Summary: Iterative similarity searches with PSI-BLAST position-specific score matrices (PSSMs) find many more homologs than single searches, but PSSMs can be contaminated when homologous alignments are extended into unrelated protein domains—homologous over-extension (HOE). PSI-Search combines an optimal Smith–Waterman local alignment sequence search, using SSEARCH, with the PSI-BLAST profile construction strategy. An optional sequence boundary-masking procedure, which prevents alignments from being extended after they are initially included, can reduce HOE errors in the PSSM profile. Preventing HOE improves selectivity for both PSI-BLAST and PSI-Search, but PSI-Search has ~4-fold better selectivity than PSI-BLAST and similar sensitivity at 50% and 60% family coverage. PSI-Search is also produces 2- for 4-fold fewer false-positives than JackHMMER, but is ~5% less sensitive.
Availability and implementation: PSI-Search is available from the authors as a standalone implementation written in Perl for Linux-compatible platforms. It is also available through a web interface (www.ebi.ac.uk/Tools/sss/psisearch) and SOAP and REST Web Services (www.ebi.ac.uk/Tools/webservices).
Owing to high evolutionary divergence, it is not always possible to identify distantly related protein domains by sequence search techniques. Intermediate sequences possess sequence features of more than one protein and facilitate detection of remotely related proteins. We have demonstrated recently the employment of Cascade PSI-BLAST where we perform PSI-BLAST for many ‘generations’, initiating searches from new homologues as well. Such a rigorous propagation through generations of PSI-BLAST employs effectively the role of intermediates in detecting distant similarities between proteins. This approach has been tested on a large number of folds and its performance in detecting superfamily level relationships is ∼35% better than simple PSI-BLAST searches. We present a web server for this search method that permits users to perform Cascade PSI-BLAST searches against the Pfam, SCOP and SwissProt databases. The URL for this server is .
Motivation: A typical PSI-BLAST search consists of iterative scanning and alignment of a large sequence database during which a scoring profile is progressively built and refined. Such a profile can also be stored and used to search against a different database of sequences. Using it to search against a database of consensus rather than native sequences is a simple add-on that boosts performance surprisingly well. The improvement comes at a price: we hypothesized that random alignment score statistics would differ between native and consensus sequences. Thus PSI-BLAST-based profile searches against consensus sequences might incorrectly estimate statistical significance of alignment scores. In addition, iterative searches against consensus databases may fail. Here, we addressed these challenges in an attempt to harness the full power of the combination of PSI-BLAST and consensus sequences.
Results: We studied alignment score statistics for various types of consensus sequences. In general, the score distribution parameters of profile-based consensus sequence alignments differed significantly from those derived for the native sequences. PSI-BLAST partially compensated for the parameter variation. We have identified a protocol for building specialized consensus sequences that significantly improved search sensitivity and preserved score distribution parameters. As a result, PSI-BLAST profiles can be used to search specialized consensus sequences without sacrificing estimates of statistical significance. We also provided results indicating that iterative PSI-BLAST searches against consensus sequences could work very well. Overall, we showed how a very popular and effective method could be used to identify significantly more relevant similarities among protein sequences.
A typical PSI-BLAST search consists of iterative scanning and alignment of a large sequence database during which a scoring profile is progressively built and refined. Such a profile can also be stored and used to search against a different database of sequences. Using it to search against a database of consensus rather than native sequences is a simple add-on that boosts performance surprisingly well. The improvement comes at a price: we hypothesized that random alignment score statistics would differ between native and consensus sequences. Thus PSI-BLAST-based profile searches against consensus sequences might incorrectly estimate statistical significance of alignment scores. In addition, iterative searches against consensus databases may fail. Here, we addressed these challenges in an attempt to harness the full power of the combination of PSI-BLAST and consensus sequences.
We studied alignment score statistics for various types of consensus sequences. In general, the score distribution parameters of profile-based consensus sequence alignments differed significantly from those derived for the native sequences. PSI-BLAST partially compensated for the parameter variation. We have identified a protocol for building specialized consensus sequences that significantly improved search sensitivity and preserved score distribution parameters. As a result, PSI-BLAST profiles can be used to search specialized consensus sequences without sacrificing estimates of statistical significance. We also provided results indicating that iterative PSI-BLAST searches against consensus sequences could work very well. Overall, we showed how a widely popular and effective method could be used to identify significantly more relevant similarities among protein sequences.
Motivation: Homologous protein families share highly conserved sequence and structure regions that are frequent targets for comparative analysis of related proteins and families. Many protein families, such as the curated domain families in the Conserved Domain Database (CDD), exhibit similar structural cores. To improve accuracy in aligning such protein families, we propose a profile–profile method CORAL that aligns individual core regions as gap-free units.
Results: CORAL computes optimal local alignment of two profiles with heuristics to preserve continuity within core regions. We benchmarked its performance on curated domains in CDD, which have pre-defined core regions, against COMPASS, HHalign and PSI-BLAST, using structure superpositions and comprehensive curator-optimized alignments as standards of truth. CORAL improves alignment accuracy on core regions over general profile methods, returning a balanced score of 0.57 for over 80% of all domain families in CDD, compared with the highest balanced score of 0.45 from other methods. Further, CORAL provides E-values to aid in detecting homologous protein families and, by respecting block boundaries, produces alignments with improved ‘readability’ that facilitate manual refinement.
Availability: CORAL will be included in future versions of the NCBI Cn3D/CDTree software, which can be downloaded at http://www.ncbi.nlm.nih.gov/Structure/cdtree/cdtree.shtml.
Supplementary information: Supplementary data are available at Bioinformatics online.
Protein sequence alignment is essential for template-based protein structure prediction and function annotation. We collect 20 sequence alignment algorithms, 10 published and 10 newly developed, which cover all representative sequence- and profile-based alignment approaches. These algorithms are benchmarked on 538 non-redundant proteins for protein fold-recognition on a uniform template library. Results demonstrate dominant advantage of profile-profile based methods, which generate models with average TM-score 26.5% higher than sequence-profile methods and 49.8% higher than sequence-sequence alignment methods. There is no obvious difference in results between methods with profiles generated from PSI-BLAST PSSM matrix and hidden Markov models. Accuracy of profile-profile alignments can be further improved by 9.6% or 21.4% when predicted or native structure features are incorporated. Nevertheless, TM-scores from profile-profile methods including experimental structural features are still 37.1% lower than that from TM-align, demonstrating that the fold-recognition problem cannot be solved solely by improving accuracy of structure feature predictions.
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.
All-versus-all BLAST, which searches for homologous pairs of sequences in a database of proteins, is used to identify potential orthologs, to find new protein families, and to provide rapid access to these homology relationships. As DNA sequencing accelerates and data sets grow, all-versus-all BLAST has become computationally demanding.
We present FastBLAST, a heuristic replacement for all-versus-all BLAST that relies on alignments of proteins to known families, obtained from tools such as PSI-BLAST and HMMer. FastBLAST avoids most of the work of all-versus-all BLAST by taking advantage of these alignments and by clustering similar sequences. FastBLAST runs in two stages: the first stage identifies additional families and aligns them, and the second stage quickly identifies the homologs of a query sequence, based on the alignments of the families, before generating pairwise alignments. On 6.53 million proteins from the non-redundant Genbank database (“NR”), FastBLAST identifies new families 25 times faster than all-versus-all BLAST. Once the first stage is completed, FastBLAST identifies homologs for the average query in less than 5 seconds (8.6 times faster than BLAST) and gives nearly identical results. For hits above 70 bits, FastBLAST identifies 98% of the top 3,250 hits per query.
FastBLAST enables research groups that do not have supercomputers to analyze large protein sequence data sets. FastBLAST is open source software and is available at http://microbesonline.org/fastblast.
The functions of proteins are closely related to their subcellular locations. In the post-genomics era, the amount of gene and protein data grows exponentially, which necessitates the prediction of subcellular localization by computational means.
This paper proposes mitigating the computation burden of alignment-based approaches to subcellular localization prediction by a cascaded fusion of cleavage site prediction and profile alignment. Specifically, the informative segments of protein sequences are identified by a cleavage site predictor using the information in their N-terminal shorting signals. Then, the sequences are truncated at the cleavage site positions, and the shortened sequences are passed to PSI-BLAST for computing their profiles. Subcellular localization are subsequently predicted by a profile-to-profile alignment support-vector-machine (SVM) classifier. To further reduce the training and recognition time of the classifier, the SVM classifier is replaced by a new kernel method based on the perturbational discriminant analysis (PDA).
Experimental results on a new dataset based on Swiss-Prot Release 57.5 show that the method can make use of the best property of signal- and homology-based approaches and can attain an accuracy comparable to that achieved by using full-length sequences. Analysis of profile-alignment score matrices suggest that both profile creation time and profile alignment time can be reduced without significant reduction in subcellular localization accuracy. It was found that PDA enjoys a short training time as compared to the conventional SVM. We advocate that the method will be important for biologists to conduct large-scale protein annotation or for bioinformaticians to perform preliminary investigations on new algorithms that involve pairwise alignments.
Position specific score matrices (PSSMs) are derived from multiple sequence alignments to aid in the recognition of distant protein sequence relationships. The PSI-BLAST protein database search program derives the column scores of its PSSMs with the aid of pseudocounts, added to the observed amino acid counts in a multiple alignment column. In the absence of theory, the number of pseudocounts used has been a completely empirical parameter. This article argues that the minimum description length principle can motivate the choice of this parameter. Specifically, for realistic alignments, the principle supports the practice of using a number of pseudocounts essentially independent of alignment size. However, it also implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs. A new method for calculating pseudocounts that significantly improves PSI-BLAST's; retrieval accuracy is now employed by default.
The Pcons.net Meta Server (http://pcons.net) provides improved automated tools for protein structure prediction and analysis using consensus. It essentially implements all the steps necessary to produce a high quality model of a protein. The whole process is fully automated and a potential user only submits the protein sequence. For PSI-BLAST detectable targets, an accurate model is generated within minutes of submission. For more difficult targets the sequence is automatically submitted to publicly available fold-recognition servers that use more advanced approaches to find distant structural homologs. The results from these servers are analyzed and assessed for structural correctness using Pcons and ProQ; and the user is presented with a ranked list of possible models. In addition, if the protein sequence contains more than one domain, these are automatically parsed out and resubmitted to the server as individual queries.
Machine learning-based methods have been proven to be powerful in developing new fold recognition tools. In our previous work [Zhang, Kochhar and Grigorov (2005) Protein Science, 14: 431-444], a machine learning-based method called DescFold was established by using Support Vector Machines (SVMs) to combine the following four descriptors: a profile-sequence-alignment-based descriptor using Psi-blast e-values and bit scores, a sequence-profile-alignment-based descriptor using Rps-blast e-values and bit scores, a descriptor based on secondary structure element alignment (SSEA), and a descriptor based on the occurrence of PROSITE functional motifs. In this work, we focus on the improvement of DescFold by incorporating more powerful descriptors and setting up a user-friendly web server.
In seeking more powerful descriptors, the profile-profile alignment score generated from the COMPASS algorithm was first considered as a new descriptor (i.e., PPA). When considering a profile-profile alignment between two proteins in the context of fold recognition, one protein is regarded as a template (i.e., its 3D structure is known). Instead of a sequence profile derived from a Psi-blast search, a structure-seeded profile for the template protein was generated by searching its structural neighbors with the assistance of the TM-align structural alignment algorithm. Moreover, the COMPASS algorithm was used again to derive a profile-structural-profile-alignment-based descriptor (i.e., PSPA). We trained and tested the new DescFold in a total of 1,835 highly diverse proteins extracted from the SCOP 1.73 version. When the PPA and PSPA descriptors were introduced, the new DescFold boosts the performance of fold recognition substantially. Using the SCOP_1.73_40% dataset as the fold library, the DescFold web server based on the trained SVM models was further constructed. To provide a large-scale test for the new DescFold, a stringent test set of 1,866 proteins were selected from the SCOP 1.75 version. At a less than 5% false positive rate control, the new DescFold is able to correctly recognize structural homologs at the fold level for nearly 46% test proteins. Additionally, we also benchmarked the DescFold method against several well-established fold recognition algorithms through the LiveBench targets and Lindahl dataset.
The new DescFold method was intensively benchmarked to have very competitive performance compared with some well-established fold recognition methods, suggesting that it can serve as a useful tool to assist in template-based protein structure prediction. The DescFold server is freely accessible at http://188.8.131.52/DescFold/index.html.
PISCES is a database server for producing lists of sequences from the Protein Data Bank (PDB) using a number of entry- and chain-specific criteria and mutual sequence identity. Our goal in culling the PDB is to provide the longest list possible of the highest resolution structures that fulfill the sequence identity and structural quality cut-offs. The new PISCES server uses a combination of PSI-BLAST and structure-based alignments to determine sequence identities. Structure alignment produces more complete alignments and therefore more accurate sequence identities than PSI-BLAST. PISCES now allows a user to cull the PDB by-entry in addition to the standard culling by individual chains. In this scenario, a list will contain only entries that do not have a chain that has a sequence identity to any chain in any other entry in the list over the sequence identity cut-off. PISCES also provides fully annotated sequences including gene name and species. The server allows a user to cull an input list of entries or chains, so that other criteria, such as function, can be used. Results from a search on the re-engineered RCSB's site for the PDB can be entered into the PISCES server by a single click, combining the powerful searching abilities of the PDB with PISCES's utilities for sequence culling. The server's data are updated weekly. The server is available at .
Development of sensitive sequence search procedures for the detection of distant relationships between proteins at superfamily/fold level is still a big challenge. The intermediate sequence search approach is the most frequently employed manner of identifying remote homologues effectively. In this study, examination of serine proteases of prolyl oligopeptidase, rhomboid and subtilisin protein families were carried out using plant serine proteases as queries from two genomes including A. thaliana and O. sativa and 13 other families of unrelated folds to identify the distant homologues which could not be obtained using PSI-BLAST.
We have proposed to start with multiple queries of classical serine protease members to identify remote homologues in families, using a rigorous approach like Cascade PSI-BLAST. We found that classical sequence based approaches, like PSI-BLAST, showed very low sequence coverage in identifying plant serine proteases. The algorithm was applied on enriched sequence database of homologous domains and we obtained overall average coverage of 88% at family, 77% at superfamily or fold level along with specificity of ∼100% and Mathew’s correlation coefficient of 0.91. Similar approach was also implemented on 13 other protein families representing every structural class in SCOP database. Further investigation with statistical tests, like jackknifing, helped us to better understand the influence of neighbouring protein families.
Our study suggests that employment of multiple queries of a family for the Cascade PSI-BLAST searches is useful for predicting distant relationships effectively even at superfamily level. We have proposed a generalized strategy to cover all the distant members of a particular family using multiple query sequences. Our findings reveal that prior selection of sequences as query and the presence of neighbouring families can be important for covering the search space effectively in minimal computational time. This study also provides an understanding of the ‘bridging’ role of related families.
Protein sequence alignments have become indispensable for virtually any evolutionary, structural or functional study involving proteins. Modern sequence search and comparison methods combined with rapidly increasing sequence data often can reliably match even distantly related proteins that share little sequence similarity. However, even highly significant matches generally may have incorrectly aligned regions. Therefore when exact residue correspondence is used to transfer biological information from one aligned sequence to another, it is critical to know which alignment regions are reliable and which may contain alignment errors.
PSI-BLAST-ISS is a standalone Unix-based tool designed to delineate reliable regions of sequence alignments as well as to suggest potential variants in unreliable regions. The region-specific reliability is assessed by producing multiple sequence alignments in different sequence contexts followed by the analysis of the consistency of alignment variants. The PSI-BLAST-ISS output enables the user to simultaneously analyze alignment reliability between query and multiple homologous sequences. In addition, PSI-BLAST-ISS can be used to detect distantly related homologous proteins. The software is freely available at: .
PSI-BLAST-ISS is an effective reliability assessment tool that can be useful in applications such as comparative modelling or analysis of individual sequence regions. It favorably compares with the existing similar software both in the performance and functional features.
BLAST Filter and GraphAlign are web-based tools that offer novel methods for building and analyzing sets of related (i.e. similar) DNA and protein sequences. They can be used separately or together. BLAST Filter generates related sequence sets in an automated, objective and reproducible way based on an input query sequence. Sequences matched by BLAST are filtered through a set of 15 user-configurable rules based on full-length query/subject comparisons, high-scoring segment pair statistics and the level of redundancy in the sequence set. Such sets can be used for multiple alignments, profile hidden Markov models and other bioinformatics applications, including GraphAlign, which provides several novel methods for analyzing global query/subject alignments along with graphical representations of sequence similarities. These services are available at the following URLs: http://darwin.nmsu.edu/cgi-bin/blast_filter.cgi and http://darwin.nmsu.edu/cgi-bin/graph_align.cgi.
Protein alignments are an essential tool for many bioinformatics analyses. While sequence alignments are accurate for proteins of high sequence similarity, they become unreliable as they approach the so-called 'twilight zone' where sequence similarity gets indistinguishable from random. For such distant pairs, structure alignment is of much better quality. Nevertheless, sequence alignment is the only choice in the majority of cases where structural data is not available. This situation demands development of methods that extend the applicability of accurate sequence alignment to distantly related proteins.
We develop a sequence alignment method that combines the prediction of a structural profile based on the protein's sequence with the alignment of that profile using our recently published alignment tool SABERTOOTH. In particular, we predict the contact vector of protein structures using an artificial neural network based on position-specific scoring matrices generated by PSI-BLAST and align these predicted contact vectors. The resulting sequence alignments are assessed using two different tests: First, we assess the alignment quality by measuring the derived structural similarity for cases in which structures are available. In a second test, we quantify the ability of the significance score of the alignments to recognize structural and evolutionary relationships. As a benchmark we use a representative set of the SCOP (structural classification of proteins) database, with similarities ranging from closely related proteins at SCOP family level, to very distantly related proteins at SCOP fold level. Comparing these results with some prominent sequence alignment tools, we find that SABERTOOTH produces sequence alignments of better quality than those of Clustal W, T-Coffee, MUSCLE, and PSI-BLAST. HHpred, one of the most sophisticated and computationally expensive tools available, outperforms our alignment algorithm at family and superfamily levels, while the use of SABERTOOTH is advantageous for alignments at fold level. Our alignment scheme will profit from future improvements of structural profiles prediction.
We present the automatic sequence alignment tool SABERTOOTH that computes pairwise sequence alignments of very high quality. SABERTOOTH is especially advantageous when applied to alignments of remotely related proteins. The source code is available at http://www.fkp.tu-darmstadt.de/sabertooth_project/, free for academic users upon request.
COMPASS is a method for homology detection and local alignment construction based on the comparison of multiple sequence alignments (MSAs). The method derives numerical profiles from given MSAs, constructs local profile-profile alignments and analytically estimates E-values for the detected similarities. Until now, COMPASS was only available for download and local installation. Here, we present a new web server featuring the latest version of COMPASS, which provides (i) increased sensitivity and selectivity of homology detection; (ii) longer, more complete alignments; and (iii) faster computational speed. After submission of the query MSA or single sequence, the server performs searches versus a user-specified database. The server includes detailed and intuitive control of the search parameters. A flexible output format, structured similarly to BLAST and PSI-BLAST, provides an easy way to read and analyze the detected profile similarities. Brief help sections are available for all input parameters and output options, along with detailed documentation. To illustrate the value of this tool for protein structure-functional prediction, we present two examples of detecting distant homologs for uncharacterized protein families. Available at http://prodata.swmed.edu/compass
Motivation:Several strategies have been developed to predict the fold of a target protein sequence, most of which are based on aligning the target sequence to other sequences of known structure. Previously, we demonstrated that the consideration of protein–protein interactions significantly increases the accuracy of fold assignment compared with PSI-BLAST sequence comparisons. A drawback of our method was the low number of proteins to which a fold could be assigned. Here, we present an improved version of the method that addresses this limitation. We also compare our method to other state-of-the-art fold assignment methodologies.
Results: Our approach (ModLink+) has been tested on 3716 proteins with domain folds classified in the Structural Classification Of Proteins (SCOP) as well as known interacting partners in the Database of Interacting Proteins (DIP). For this test set, the ratio of success [positive predictive value (PPV)] on fold assignment increases from 75% for PSI-BLAST, 83% for HHSearch and 81% for PRC to >90% for ModLink+at the e-value cutoff of 10−3. Under this e-value, ModLink+can assign a fold to 30–45% of the proteins in the test set, while our previous method could cover <25%. When applied to 6384 proteins with unknown fold in the yeast proteome, ModLink+combined with PSI-BLAST assigns a fold for domains in 3738 proteins, while PSI-BLAST alone covers only 2122 proteins, HHSearch 2969 and PRC 2826 proteins, using a threshold e-value that would represent a PPV >82% for each method in the test set.
Availability: The ModLink+server is freely accessible in the World Wide Web at http://sbi.imim.es/modlink/.
Supplementary information: Supplementary data are available at Bioinformatics online.
PAirwise Sequence Comparison (PASC) is a tool that uses genome sequence similarity to help with virus classification. The PASC tool at NCBI uses two methods: local alignment based on BLAST and global alignment based on Needleman-Wunsch algorithm. It works for complete genomes of viruses of several families/groups, and for the family of Filoviridae, it currently includes 52 complete genomes available in GenBank. It has been shown that BLAST-based alignment approach works better for filoviruses, and therefore is recommended for establishing taxon demarcation criteria. When more genome sequences with high divergence become available, these demarcations will most likely become more precise. The tool can compare new genome sequences of filoviruses with the ones already in the database, and propose their taxonomic classification.
Filoviridae; filovirus; ICTV; International Committee on Taxonomy of Viruses; National Center for Biotechnology Information; NCBI; PAirwise Sequence Comparison; PASC; virus classification; virus taxonomy
Profile hidden Markov models (HMMs) are amongst the most successful procedures for detecting remote homology between proteins. There are two popular profile HMM programs, HMMER and SAM. Little is known about their performance relative to each other and to the recently improved version of PSI-BLAST. Here we compare the two programs to each other and to non-HMM methods, to determine their relative performance and the features that are important for their success. The quality of the multiple sequence alignments used to build models was the most important factor affecting the overall performance of profile HMMs. The SAM T99 procedure is needed to produce high quality alignments automatically, and the lack of an equivalent component in HMMER makes it less complete as a package. Using the default options and parameters as would be expected of an inexpert user, it was found that from identical alignments SAM consistently produces better models than HMMER and that the relative performance of the model-scoring components varies. On average, HMMER was found to be between one and three times faster than SAM when searching databases larger than 2000 sequences, SAM being faster on smaller ones. Both methods were shown to have effective low complexity and repeat sequence masking using their null models, and the accuracy of their E-values was comparable. It was found that the SAM T99 iterative database search procedure performs better than the most recent version of PSI-BLAST, but that scoring of PSI-BLAST profiles is more than 30 times faster than scoring of SAM models.
Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-based methods is to find a suitable representation of protein sequences.
In this paper, a novel building block of proteins called Top-n-grams is presented, which contains the evolutionary information extracted from the protein sequence frequency profiles. The protein sequence frequency profiles are calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into Top-n-grams. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top-n-gram. The training vectors are evaluated by SVM to train classifiers which are then used to classify the test protein sequences. We demonstrate that the prediction performance of remote homology detection and fold recognition can be improved by combining Top-n-grams and latent semantic analysis (LSA), which is an efficient feature extraction technique from natural language processing. When tested on superfamily and fold benchmarks, the method combining Top-n-grams and LSA gives significantly better results compared to related methods.
The method based on Top-n-grams significantly outperforms the methods based on many other building blocks including N-grams, patterns, motifs and binary profiles. Therefore, Top-n-gram is a good building block of the protein sequences and can be widely used in many tasks of the computational biology, such as the sequence alignment, the prediction of domain boundary, the designation of knowledge-based potentials and the prediction of protein binding sites.
One of the most powerful methods for the prediction of protein structure from sequence information alone is the iterative construction of profile-type models. Because profiles are built from sequence alignments, the sequences included in the alignment and the method used to align them will be important to the sensitivity of the resulting profile. The inclusion of highly diverse sequences will presumably produce a more powerful profile, but distantly related sequences can be difficult to align accurately using only sequence information. Therefore, it would be expected that the use of protein structure alignments to improve the selection and alignment of diverse sequence homologs might yield improved profiles. However, the actual utility of such an approach has remained unclear.
We explored several iterative protocols for the generation of profile hidden Markov models. These protocols were tailored to allow the inclusion of protein structure alignments in the process, and were used for large-scale creation and benchmarking of structure alignment-enhanced models. We found that models using structure alignments did not provide an overall improvement over sequence-only models for superfamily-level structure predictions. However, the results also revealed that the structure alignment-enhanced models were complimentary to the sequence-only models, particularly at the edge of the "twilight zone". When the two sets of models were combined, they provided improved results over sequence-only models alone. In addition, we found that the beneficial effects of the structure alignment-enhanced models could not be realized if the structure-based alignments were replaced with sequence-based alignments. Our experiments with different iterative protocols for sequence-only models also suggested that simple protocol modifications were unable to yield equivalent improvements to those provided by the structure alignment-enhanced models. Finally, we found that models using structure alignments provided fold-level structure assignments that were superior to those produced by sequence-only models.
When attempting to predict the structure of remote homologs, we advocate a combined approach in which both traditional models and models incorporating structure alignments are used.
Protein structure prediction provides valuable insights into function, and comparative modeling is one of the most reliable methods to predict 3D structures directly from amino acid sequences. However, critical problems arise during the selection of the correct templates and the alignment of query sequences therewith. We have developed an automatic protein structure prediction server, (PS)2, which uses an effective consensus strategy both in template selection, which combines PSI-BLAST and IMPALA, and target–template alignment integrating PSI-BLAST, IMPALA and T-Coffee. (PS)2 was evaluated for 47 comparative modeling targets in CASP6 (Critical Assessment of Techniques for Protein Structure Prediction). For the benchmark dataset, the predictive performance of (PS)2, based on the mean GTD_TS score, was superior to 10 other automatic servers. Our method is based solely on the consensus sequence and thus is considerably faster than other methods that rely on the additional structural consensus of templates. Our results show that (PS)2, coupled with suitable consensus strategies and a new similarity score, can significantly improve structure prediction. Our approach should be useful in structure prediction and modeling. The (PS)2 is available through the website at .
Transposable elements (TEs) are major components of both prokaryotic and eukaryotic genomes and play a significant role in their evolution. In this study, we have identified new prokaryotic DDE transposase families related to the eukaryotic Mutator-like transposases. These genes were retrieved by cascade PSI-Blast using as initial query the transposase of the streptococcal integrative and conjugative element (ICE) TnGBS2. By combining secondary structure predictions and protein sequence alignments, we predicted the DDE catalytic triad and the DNA-binding domain recognizing the terminal inverted repeats. Furthermore, we systematically characterized the organization and the insertion specificity of the TEs relying on these prokaryotic Mutator-like transposases (p-MULT) for their mobility. Strikingly, two distant TE families target their integration upstream σA dependent promoters. This allowed us to identify a transposase sequence signature associated with this unique insertion specificity and to show that the dissymmetry between the two inverted repeats is responsible for the orientation of the insertion. Surprisingly, while DDE transposases are generally associated with small and simple transposons such as insertion sequences (ISs), p-MULT encoding TEs show an unprecedented diversity with several families of IS, transposons, and ICEs ranging in size from 1.1 to 52 kb.
transposase; integrative and conjugative element; insertion sequence; evolution; genome dynamics