Search tips
Search criteria

Results 1-25 (798204)

Clipboard (0)

Related Articles

1.  Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints 
BMC Bioinformatics  2006;7:400.
We are interested in the problem of predicting secondary structure for small sets of homologous RNAs, by incorporating limited comparative sequence information into an RNA folding model. The Sankoff algorithm for simultaneous RNA folding and alignment is a basis for approaches to this problem. There are two open problems in applying a Sankoff algorithm: development of a good unified scoring system for alignment and folding and development of practical heuristics for dealing with the computational complexity of the algorithm.
We use probabilistic models (pair stochastic context-free grammars, pairSCFGs) as a unifying framework for scoring pairwise alignment and folding. A constrained version of the pairSCFG structural alignment algorithm was developed which assumes knowledge of a few confidently aligned positions (pins). These pins are selected based on the posterior probabilities of a probabilistic pairwise sequence alignment.
Pairwise RNA structural alignment improves on structure prediction accuracy relative to single sequence folding. Constraining on alignment is a straightforward method of reducing the runtime and memory requirements of the algorithm. Five practical implementations of the pairwise Sankoff algorithm – this work (Consan), David Mathews' Dynalign, Ian Holmes' Stemloc, Ivo Hofacker's PMcomp, and Jan Gorodkin's FOLDALIGN – have comparable overall performance with different strengths and weaknesses.
PMCID: PMC1579236  PMID: 16952317
2.  SCOP: a Structural Classification of Proteins database 
Nucleic Acids Research  2000;28(1):257-259.
The Structural Classification of Proteins (SCOP) database provides a detailed and comprehensive description of the relationships of known protein structures. The classification is on hierarchical levels: the first two levels, family and superfamily, describe near and distant evolutionary relationships; the third, fold, describes geometrical relationships. The distinction between evolutionary relationships and those that arise from the physics and chemistry of proteins is a feature that is unique to this database so far. The sequences of proteins in SCOP provide the basis of the ASTRAL sequence libraries that can be used as a source of data to calibrate sequence search algorithms and for the generation of statistics on, or selections of, protein structures. Links can be made from SCOP to PDB-ISL: a library containing sequences homologous to proteins of known structure. Sequences of proteins of unknown structure can be matched to distantly related proteins of known structure by using pairwise sequence comparison methods to find homologues in PDB-ISL. The database and its associated files are freely accessible from a number of WWW sites mirrored from URL http://scop.mrc-lmb.
PMCID: PMC102479  PMID: 10592240
3.  Ancestral sequence alignment under optimal conditions 
BMC Bioinformatics  2005;6:273.
Multiple genome alignment is an important problem in bioinformatics. An important subproblem used by many multiple alignment approaches is that of aligning two multiple alignments. Many popular alignment algorithms for DNA use the sum-of-pairs heuristic, where the score of a multiple alignment is the sum of its induced pairwise alignment scores. However, the biological meaning of the sum-of-pairs of pairs heuristic is not obvious. Additionally, many algorithms based on the sum-of-pairs heuristic are complicated and slow, compared to pairwise alignment algorithms.
An alternative approach to aligning alignments is to first infer ancestral sequences for each alignment, and then align the two ancestral sequences. In addition to being fast, this method has a clear biological basis that takes into account the evolution implied by an underlying phylogenetic tree.
In this study we explore the accuracy of aligning alignments by ancestral sequence alignment. We examine the use of both maximum likelihood and parsimony to infer ancestral sequences. Additionally, we investigate the effect on accuracy of allowing ambiguity in our ancestral sequences.
We use synthetic sequence data that we generate by simulating evolution on a phylogenetic tree. We use two different types of phylogenetic trees: trees with a period of rapid growth followed by a period of slow growth, and trees with a period of slow growth followed by a period of rapid growth.
We examine the alignment accuracy of four ancestral sequence reconstruction and alignment methods: parsimony, maximum likelihood, ambiguous parsimony, and ambiguous maximum likelihood. Additionally, we compare against the alignment accuracy of two sum-of-pairs algorithms: ClustalW and the heuristic of Ma, Zhang, and Wang.
We find that allowing ambiguity in ancestral sequences does not lead to better multiple alignments. Regardless of whether we use parsimony or maximum likelihood, the success of aligning ancestral sequences containing ambiguity is very sensitive to the choice of gap open cost. Surprisingly, we find that using maximum likelihood to infer ancestral sequences results in less accurate alignments than when using parsimony to infer ancestral sequences. Finally, we find that the sum-of-pairs methods produce better alignments than all of the ancestral alignment methods.
PMCID: PMC1310622  PMID: 16293191
4.  Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization 
BMC Bioinformatics  2007;8:271.
The discovery of functional non-coding RNA sequences has led to an increasing interest in algorithms related to RNA analysis. Traditional sequence alignment algorithms, however, fail at computing reliable alignments of low-homology RNA sequences. The spatial conformation of RNA sequences largely determines their function, and therefore RNA alignment algorithms have to take structural information into account.
We present a graph-based representation for sequence-structure alignments, which we model as an integer linear program (ILP). We sketch how we compute an optimal or near-optimal solution to the ILP using methods from combinatorial optimization, and present results on a recently published benchmark set for RNA alignments.
The implementation of our algorithm yields better alignments in terms of two published scores than the other programs that we tested: This is especially the case with an increasing number of input sequences. Our program LARA is freely available for academic purposes from .
PMCID: PMC1955456  PMID: 17662141
5.  The HHpred interactive server for protein homology detection and structure prediction 
Nucleic Acids Research  2005;33(Web Server issue):W244-W248.
HHpred is a fast server for remote protein homology detection and structure prediction and is the first to implement pairwise comparison of profile hidden Markov models (HMMs). It allows to search a wide choice of databases, such as the PDB, SCOP, Pfam, SMART, COGs and CDD. It accepts a single query sequence or a multiple alignment as input. Within only a few minutes it returns the search results in a user-friendly format similar to that of PSI-BLAST. Search options include local or global alignment and scoring secondary structure similarity. HHpred can produce pairwise query-template alignments, multiple alignments of the query with a set of templates selected from the search results, as well as 3D structural models that are calculated by the MODELLER software from these alignments. A detailed help facility is available. As a demonstration, we analyze the sequence of SpoVT, a transcriptional regulator from Bacillus subtilis. HHpred can be accessed at .
PMCID: PMC1160169  PMID: 15980461
6.  High-Affinity Glucose Transport in Aspergillus nidulans Is Mediated by the Products of Two Related but Differentially Expressed Genes 
PLoS ONE  2014;9(4):e94662.
Independent systems of high and low affinity effect glucose uptake in the filamentous fungus Aspergillus nidulans. Low-affinity uptake is known to be mediated by the product of the mstE gene. In the current work two genes, mstA and mstC, have been identified that encode high-affinity glucose transporter proteins. These proteins' primary structures share over 90% similarity, indicating that the corresponding genes share a common origin. Whilst the function of the paralogous proteins is little changed, they differ notably in their patterns of expression. The mstC gene is expressed during the early phases of germination and is subject to CreA-mediated carbon catabolite repression whereas mstA is expressed as a culture tends toward carbon starvation. In addition, various pieces of genetic evidence strongly support allelism of mstC and the previously described locus sorA. Overall, our data define MstC/SorA as a high-affinity glucose transporter expressed in germinating conidia, and MstA as a high-affinity glucose transporter that operates in vegetative hyphae under conditions of carbon limitation.
PMCID: PMC3994029  PMID: 24751997
7.  MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information 
Nucleic Acids Research  2006;34(16):4364-4374.
We have developed MUMMALS, a program to construct multiple protein sequence alignment using probabilistic consistency. MUMMALS improves alignment quality by using pairwise alignment hidden Markov models (HMMs) with multiple match states that describe local structural information without exploiting explicit structure predictions. Parameters for such models have been estimated from a large library of structure-based alignments. We show that (i) on remote homologs, MUMMALS achieves statistically best accuracy among several leading aligners, such as ProbCons, MAFFT and MUSCLE, albeit the average improvement is small, in the order of several percent; (ii) a large collection (>10 000) of automatically computed pairwise structure alignments of divergent protein domains is superior to smaller but carefully curated datasets for estimation of alignment parameters and performance tests; (iii) reference-independent evaluation of alignment quality using sequence alignment-dependent structure superpositions correlates well with reference-dependent evaluation that compares sequence-based alignments to structure-based reference alignments.
PMCID: PMC1636350  PMID: 16936316
8.  R-Coffee: a method for multiple alignment of non-coding RNA 
Nucleic Acids Research  2008;36(9):e52.
R-Coffee is a multiple RNA alignment package, derived from T-Coffee, designed to align RNA sequences while exploiting secondary structure information. R-Coffee uses an alignment-scoring scheme that incorporates secondary structure information within the alignment. It works particularly well as an alignment improver and can be combined with any existing sequence alignment method. In this work, we used R-Coffee to compute multiple sequence alignments combining the pairwise output of sequence aligners and structural aligners. We show that R-Coffee can improve the accuracy of all the sequence aligners. We also show that the consistency-based component of T-Coffee can improve the accuracy of several structural aligners. R-Coffee was tested on 388 BRAliBase reference datasets and on 11 longer Cmfinder datasets. Altogether our results suggest that the best protocol for aligning short sequences (less than 200 nt) is the combination of R-Coffee with the RNA pairwise structural aligner Consan. We also show that the simultaneous combination of the four best sequence alignment programs with R-Coffee produces alignments almost as accurate as those obtained with R-Coffee/Consan. Finally, we show that R-Coffee can also be used to align longer datasets beyond the usual scope of structural aligners. R-Coffee is freely available for download, along with documentation, from the T-Coffee web site (
PMCID: PMC2396437  PMID: 18420654
9.  Improved accuracy of multiple ncRNA alignment by incorporating structural information into a MAFFT-based framework 
BMC Bioinformatics  2008;9:212.
Structural alignment of RNAs is becoming important, since the discovery of functional non-coding RNAs (ncRNAs). Recent studies, mainly based on various approximations of the Sankoff algorithm, have resulted in considerable improvement in the accuracy of pairwise structural alignment. In contrast, for the cases with more than two sequences, the practical merit of structural alignment remains unclear as compared to traditional sequence-based methods, although the importance of multiple structural alignment is widely recognized.
We took a different approach from a straightforward extension of the Sankoff algorithm to the multiple alignments from the viewpoints of accuracy and time complexity. As a new option of the MAFFT alignment program, we developed a multiple RNA alignment framework, X-INS-i, which builds a multiple alignment with an iterative method incorporating structural information through two components: (1) pairwise structural alignments by an external pairwise alignment method such as SCARNA or LaRA and (2) a new objective function, Four-way Consistency, derived from the base-pairing probability of every sub-aligned group at every multiple alignment stage.
The BRAliBASE benchmark showed that X-INS-i outperforms other methods currently available in the sum-of-pairs score (SPS) criterion. As a basis for predicting common secondary structure, the accuracy of the present method is comparable to or rather higher than those of the current leading methods such as RNA Sampler. The X-INS-i framework can be used for building a multiple RNA alignment from any combination of algorithms for pairwise RNA alignment and base-pairing probability. The source code is available at the webpage found in the Availability and requirements section.
PMCID: PMC2387179  PMID: 18439255
10.  SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications 
PLoS ONE  2013;8(12):e82138.
The Smith-Waterman algorithm, which produces the optimal pairwise alignment between two sequences, is frequently used as a key component of fast heuristic read mapping and variation detection tools for next-generation sequencing data. Though various fast Smith-Waterman implementations are developed, they are either designed as monolithic protein database searching tools, which do not return detailed alignment, or are embedded into other tools. These issues make reusing these efficient Smith-Waterman implementations impractical.
To facilitate easy integration of the fast Single-Instruction-Multiple-Data Smith-Waterman algorithm into third-party software, we wrote a C/C++ library, which extends Farrar’s Striped Smith-Waterman (SSW) to return alignment information in addition to the optimal Smith-Waterman score. In this library we developed a new method to generate the full optimal alignment results and a suboptimal score in linear space at little cost of efficiency. This improvement makes the fast Single-Instruction-Multiple-Data Smith-Waterman become really useful in genomic applications. SSW is available both as a C/C++ software library, as well as a stand-alone alignment tool at:
The SSW library has been used in the primary read mapping tool MOSAIK, the split-read mapping program SCISSORS, the MEI detector TANGRAM, and the read-overlap graph generation program RZMBLR. The speeds of the mentioned software are improved significantly by replacing their ordinary Smith-Waterman or banded Smith-Waterman module with the SSW Library.
PMCID: PMC3852983  PMID: 24324759
11.  PASS2: an automated database of protein alignments organised as structural superfamilies 
BMC Bioinformatics  2004;5:35.
The functional selection and three-dimensional structural constraints of proteins in nature often relates to the retention of significant sequence similarity between proteins of similar fold and function despite poor sequence identity. Organization of structure-based sequence alignments for distantly related proteins, provides a map of the conserved and critical regions of the protein universe that is useful for the analysis of folding principles, for the evolutionary unification of protein families and for maximizing the information return from experimental structure determination. The Protein Alignment organised as Structural Superfamily (PASS2) database represents continuously updated, structural alignments for evolutionary related, sequentially distant proteins.
An automated and updated version of PASS2 is, in direct correspondence with SCOP 1.63, consisting of sequences having identity below 40% among themselves. Protein domains have been grouped into 628 multi-member superfamilies and 566 single member superfamilies. Structure-based sequence alignments for the superfamilies have been obtained using COMPARER, while initial equivalencies have been derived from a preliminary superposition using LSQMAN or STAMP 4.0. The final sequence alignments have been annotated for structural features using JOY4.0. The database is supplemented with sequence relatives belonging to different genomes, conserved spatially interacting and structural motifs, probabilistic hidden markov models of superfamilies based on the alignments and useful links to other databases. Probabilistic models and sensitive position specific profiles obtained from reliable superfamily alignments aid annotation of remote homologues and are useful tools in structural and functional genomics. PASS2 presents the phylogeny of its members both based on sequence and structural dissimilarities. Clustering of members allows us to understand diversification of the family members. The search engine has been improved for simpler browsing of the database.
The database resolves alignments among the structural domains consisting of evolutionarily diverged set of sequences. Availability of reliable sequence alignments of distantly related proteins despite poor sequence identity and single-member superfamilies permit better sampling of structures in libraries for fold recognition of new sequences and for the understanding of protein structure-function relationships of individual superfamilies. PASS2 is accessible at
PMCID: PMC407847  PMID: 15059245
12.  Contact-based sequence alignment 
Nucleic Acids Research  2004;32(8):2464-2473.
This paper introduces the novel method of contact-based protein sequence alignment, where structural information in the form of contact mutation probabilities is incorporated into an alignment routine using contact-mutation matrices (CAO: Contact Accepted mutatiOn). The contact-based alignment routine optimizes the score of matched contacts, which involves four (two per contact) instead of two residues per match in pairwise alignments. The first contact refers to a real side-chain contact in a template sequence with known structure, and the second contact is the equivalent putative contact of a homologous query sequence with unknown structure. An algorithm has been devised to perform a pairwise sequence alignment based on contact information. The contact scores were combined with PAM-type (Point Accepted Mutation) substitution scores after parameterization of gap penalties and score weights by means of a genetic algorithm. We show that owing to the structural information contained in the CAO matrices, significantly improved alignments of distantly related sequences can be obtained. This has allowed us to annotate eight putative Drosophila IGF sequences. Contact-based sequence alignment should therefore prove useful in comparative modelling and fold recognition.
PMCID: PMC419454  PMID: 15121903
13.  Comprehensive comparison of graph based multiple protein sequence alignment strategies 
BMC Bioinformatics  2012;13:64.
Alignment of protein sequences (MPSA) is the starting point for a multitude of applications in molecular biology. Here, we present a novel MPSA program based on the SeqAn sequence alignment library. Our implementation has a strict modular structure, which allows to swap different components of the alignment process and, thus, to investigate their contribution to the alignment quality and computation time. We systematically varied information sources, guiding trees, score transformations and iterative refinement options, and evaluated the resulting alignments on BAliBASE and SABmark.
Our results indicate the optimal alignment strategy based on the choices compared. First, we show that pairwise global and local alignments contain sufficient information to construct a high quality multiple alignment. Second, single linkage clustering is almost invariably the best algorithm to build a guiding tree for progressive alignment. Third, triplet library extension, with introduction of new edges, is the most efficient consistency transformation of those compared. Alternatively, one can apply tree dependent partitioning as a post processing step, which was shown to be comparable with the best consistency transformation in both time and accuracy. Finally, propagating information beyond four transitive links introduces more noise than signal.
This is the first time multiple protein alignment strategies are comprehensively and clearly compared using a single implementation platform. In particular, we showed which of the existing consistency transformations and iterative refinement techniques are the most valid. Our implementation is freely available at and as a supplementary file attached to this article (see Additional file 1).
PMCID: PMC3375188  PMID: 22540977
14.  R-Coffee: a web server for accurately aligning noncoding RNA sequences 
Nucleic Acids Research  2008;36(Web Server issue):W10-W13.
The R-Coffee web server produces highly accurate multiple alignments of noncoding RNA (ncRNA) sequences, taking into account predicted secondary structures. R-Coffee uses a novel algorithm recently incorporated in the T-Coffee package. R-Coffee works along the same lines as T-Coffee: it uses pairwise or multiple sequence alignment (MSA) methods to compute a primary library of input alignments. The program then computes an MSA highly consistent with both the alignments contained in the library and the secondary structures associated with the sequences. The secondary structures are predicted using RNAplfold. The server provides two modes. The slow/accurate mode is restricted to small datasets (less than 5 sequences less than 150 nucleotides) and combines R-Coffee with Consan, a very accurate pairwise RNA alignment method. For larger datasets a fast method can be used (RM-Coffee mode), that uses R-Coffee to combine the output of the three packages which combines the outputs from programs found to perform best on RNA (MUSCLE, MAFFT and ProbConsRNA). Our BRAliBase benchmarks indicate that the R-Coffee/Consan combination is one of the best ncRNA alignment methods for short sequences, while the RM-Coffee gives comparable results on longer sequences. The R-Coffee web server is available at
PMCID: PMC2447777  PMID: 18483080
15.  PSAR: measuring multiple sequence alignment reliability by probabilistic sampling 
Nucleic Acids Research  2011;39(15):6359-6368.
Multiple sequence alignment, which is of fundamental importance for comparative genomics, is a difficult problem and error-prone. Therefore, it is essential to measure the reliability of the alignments and incorporate it into downstream analyses. We propose a new probabilistic sampling-based alignment reliability (PSAR) score. Instead of relying on heuristic assumptions, such as the correlation between alignment quality and guide tree uncertainty in progressive alignment methods, we directly generate suboptimal alignments from an input multiple sequence alignment by a probabilistic sampling method, and compute the agreement of the input alignment with the suboptimal alignments as the alignment reliability score. We construct the suboptimal alignments by an approximate method that is based on pairwise comparisons between each single sequence and the sub-alignment of the input alignment where the chosen sequence is left out. By using simulation-based benchmarks, we find that our approach is superior to existing ones, supporting that the suboptimal alignments are highly informative source for assessing alignment reliability. We apply the PSAR method to the alignments in the UCSC Genome Browser to measure the reliability of alignments in different types of regions, such as coding exons and conserved non-coding regions, and use it to guide cross-species conservation study.
PMCID: PMC3159474  PMID: 21576232
16.  Homology modeling using parametric alignment ensemble generation with consensus and energy-based model selection 
Nucleic Acids Research  2006;34(17):e112.
The accuracy of a homology model based on the structure of a distant relative or other topologically equivalent protein is primarily limited by the quality of the alignment. Here we describe a systematic approach for sequence-to-structure alignment, called ‘K*Sync’, in which alignments are generated by dynamic programming using a scoring function that combines information on many protein features, including a novel measure of how obligate a sequence region is to the protein fold. By systematically varying the weights on the different features that contribute to the alignment score, we generate very large ensembles of diverse alignments, each optimal under a particular constellation of weights. We investigate a variety of approaches to select the best models from the ensemble, including consensus of the alignments, a hydrophobic burial measure, low- and high-resolution energy functions, and combinations of these evaluation methods. The effect on model quality and selection resulting from loop modeling and backbone optimization is also studied. The performance of the method on a benchmark set is reported and shows the approach to be effective at both generating and selecting accurate alignments. The method serves as the foundation of the homology modeling module in the Robetta server.
PMCID: PMC1635247  PMID: 16971460
17.  Protein sequence alignment with family-specific amino acid similarity matrices 
BMC Research Notes  2011;4:296.
Alignment of amino acid sequences by means of dynamic programming is a cornerstone sequence comparison method. The quality of alignments produced by dynamic programming critically depends on the choice of the alignment scoring function. Therefore, for a specific alignment problem one needs a way of selecting the best performing scoring function. This work is focused on the issue of finding optimized protein family- and fold-specific scoring functions for global similarity matrix-based sequence alignment.
I utilize a comprehensive set of reference alignments obtained from structural superposition of homologous and analogous proteins to design a quantitative statistical framework for evaluating the performance of alignment scoring functions in global pairwise sequence alignment. This framework is applied to study how existing general-purpose amino acid similarity matrices perform on individual protein families and structural folds, and to compare them to family-specific and fold-specific matrices derived in this work. I describe an adaptive alignment procedure that automatically selects an appropriate similarity matrix and optimized gap penalties based on the properties of the sequences being aligned.
The results of this work indicate that using family-specific similarity matrices significantly improves the quality of the alignment of homologous sequences over the traditional sequence alignment based on a single general-purpose similarity matrix. However, using fold-specific similarity matrices can only marginally improve sequence alignment of proteins that share the same structural fold but do not share a common evolutionary origin. The family-specific matrices derived in this work and the optimized gap penalties are available at
PMCID: PMC3201029  PMID: 21846354
18.  Finding the most significant common sequence and structure motifs in a set of RNA sequences. 
Nucleic Acids Research  1997;25(18):3724-3732.
We present a computational scheme to locally align a collection of RNA sequences using sequence and structure constraints. In addition, the method searches for the resulting alignments with the most significant common motifs, among all possible collections. The first part utilizes a simplified version of the Sankoff algorithm for simultaneous folding and alignment of RNA sequences, but maintains tractability by constructing multi-sequence alignments from pairwise comparisons. The algorithm finds the multiple alignments using a greedy approach and has similarities to both CLUSTAL and CONSENSUS, but the core algorithm assures that the pairwise alignments are optimized for both sequence and structure conservation. The choice of scoring system and the method of progressively constructing the final solution are important considerations that are discussed. Example solutions, and comparisons with other approaches, are provided. The solutions include finding consensus structures identical to published ones.
PMCID: PMC146942  PMID: 9278497
19.  A comprehensive system for evaluation of remote sequence similarity detection 
BMC Bioinformatics  2007;8:314.
Accurate and sensitive performance evaluation is crucial for both effective development of better structure prediction methods based on sequence similarity, and for the comparative analysis of existing methods. Up to date, there has been no satisfactory comprehensive evaluation method that (i) is based on a large and statistically unbiased set of proteins with clearly defined relationships; and (ii) covers all performance aspects of sequence-based structure predictors, such as sensitivity and specificity, alignment accuracy and coverage, and structure template quality.
With the aim of designing such a method, we (i) select a statistically balanced set of divergent protein domains from SCOP, and define similarity relationships for the majority of these domains by complementing the best of information available in SCOP with a rigorous SVM-based algorithm; and (ii) develop protocols for the assessment of similarity detection and alignment quality from several complementary perspectives. The evaluation of similarity detection is based on ROC-like curves and includes several complementary approaches to the definition of true/false positives. Reference-dependent approaches use the 'gold standard' of pre-defined domain relationships and structure-based alignments. Reference-independent approaches assess the quality of structural match predicted by the sequence alignment, with respect to the whole domain length (global mode) or to the aligned region only (local mode). Similarly, the evaluation of alignment quality includes several reference-dependent and -independent measures, in global and local modes. As an illustration, we use our benchmark to compare the performance of several methods for the detection of remote sequence similarities, and show that different aspects of evaluation reveal different properties of the evaluated methods, highlighting their advantages, weaknesses, and potential for further development.
The presented benchmark provides a new tool for a statistically unbiased assessment of methods for remote sequence similarity detection, from various complementary perspectives. This tool should be useful both for users choosing the best method for a given purpose, and for developers designing new, more powerful methods. The benchmark set, reference alignments, and evaluation codes can be downloaded from .
PMCID: PMC2031906  PMID: 17725841
20.  Exploring the limits of fold discrimination by structural alignment: A large scale benchmark using decoys of known fold 
Graphical abstract
► Structure alignment methods are used to assign proteins to fold groups. ► The accuracy of this procedure is difficult to test as the definition of folds can be debated. ► By defining folds topologically and building decoys with defined fold the accuracy of fold assignments can be tested. ► Protein structure alignments decoys of different fold can be assigned a high significance, leading to errors in fold classification. ► This observation can be extended to comparisons between decoy models and real protein structures.
Protein structure comparison by pairwise alignment is commonly used to identify highly similar substructures in pairs of proteins and provide a measure of structural similarity based on the size and geometric similarity of the match. These scores are routinely applied in analyses of protein fold space under the assumption that high statistical significance is equivalent to a meaningful relationship, however the truth of this assumption has previously been difficult to test since there is a lack of automated methods which do not rely on the same underlying principles. As a resolution to this we present a method based on the use of topological descriptions of global protein structure, providing an independent means to assess the ability of structural alignment to maintain meaningful structural correspondances on a large scale.
Using a large set of decoys of specified global fold we benchmark three widely used methods for structure comparison, SAP, TM-align and DALI, and test the degree to which this assumption is justified for these methods. Application of a topological edit distance measure to provide a scale of the degree of fold change shows that while there is a broad correlation between high structural alignment scores and low edit distances there remain many pairs of highly significant score which differ by core strand swaps and therefore are structurally different on a global level. Possible causes of this problem and its meaning for present assessments of protein fold space are discussed.
PMCID: PMC3145973  PMID: 21704264
Protein structure comparison; Protein fold; Protein structure alignment; SAP; DALI; TM-align; TM-score; Decoy model
21.  Protein structure alignment beyond spatial proximity 
Scientific Reports  2013;3:1448.
Protein structure alignment is a fundamental problem in computational structure biology. Many programs have been developed for automatic protein structure alignment, but most of them align two protein structures purely based upon geometric similarity without considering evolutionary and functional relationship. As such, these programs may generate structure alignments which are not very biologically meaningful from the evolutionary perspective. This paper presents a novel method DeepAlign for automatic pairwise protein structure alignment. DeepAlign aligns two protein structures using not only spatial proximity of equivalent residues (after rigid-body superposition), but also evolutionary relationship and hydrogen-bonding similarity. Experimental results show that DeepAlign can generate structure alignments much more consistent with manually-curated alignments than other automatic tools especially when proteins under consideration are remote homologs. These results imply that in addition to geometric similarity, evolutionary information and hydrogen-bonding similarity are essential to aligning two protein structures.
PMCID: PMC3596798  PMID: 23486213
22.  Evolutionary inaccuracy of pairwise structural alignments 
Bioinformatics  2012;28(9):1209-1215.
Motivation: Structural alignment methods are widely used to generate gold standard alignments for improving multiple sequence alignments and transferring functional annotations, as well as for assigning structural distances between proteins. However, the correctness of the alignments generated by these methods is difficult to assess objectively since little is known about the exact evolutionary history of most proteins. Since homology is an equivalence relation, an upper bound on alignment quality can be found by assessing the consistency of alignments. Measuring the consistency of current methods of structure alignment and determining the causes of inconsistencies can, therefore, provide information on the quality of current methods and suggest possibilities for further improvement.
Results: We analyze the self-consistency of seven widely-used structural alignment methods (SAP, TM-align, Fr-TM-align, MAMMOTH, DALI, CE and FATCAT) on a diverse, non-redundant set of 1863 domains from the SCOP database and demonstrate that even for relatively similar proteins the degree of inconsistency of the alignments on a residue level is high (30%). We further show that levels of consistency vary substantially between methods, with two methods (SAP and Fr-TM-align) producing more consistent alignments than the rest. Inconsistency is found to be higher near gaps and for proteins of low structural complexity, as well as for helices. The ability of the methods to identify good structural alignments is also assessed using geometric measures, for which FATCAT (flexible mode) is found to be the best performer despite being highly inconsistent. We conclude that there is substantial scope for improving the consistency of structural alignment methods.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3338010  PMID: 22399676
23.  CE-MC: a multiple protein structure alignment server 
Nucleic Acids Research  2004;32(Web Server issue):W100-W103.
CE-MC server ( provides a web-based facility for the alignment of multiple protein structures based on C-α coordinate distances, using combinatorial extension (CE) and Monte Carlo (MC) optimization methods. Alignments are possible for user-selected PDB (Protein Data Bank) chains as well as for user-uploaded structures or the combination of the two. The whole process of generating multiple structure alignments involves three distinct steps, i.e. all-to-all pairwise alignment using the CE algorithm, iterative global optimization of a multiple alignment using the MC algorithm and formatting MC results using the JOY program. The server can be used to get multiple alignments for up to 25 protein structural chains with the flexibility of uploading multiple coordinate files and performing multiple structure alignment for user-selected PDB chains. For large-scale jobs and local installation of the CE-MC program, users can download the source code and precompiled binaries from the web server.
PMCID: PMC441602  PMID: 15215359
24.  The ASTRAL compendium for protein structure and sequence analysis 
Nucleic Acids Research  2000;28(1):254-256.
The ASTRAL compendium provides several databases and tools to aid in the analysis of protein structures, particularly through the use of their sequences. The SPACI scores included in the system summarize the overall characteristics of a protein structure. A structural alignments database indicates residue equivalencies in superimposed protein domain structures. The PDB sequence-map files provide a linkage between the amino acid sequence of the molecule studied (SEQRES records in a database entry) and the sequence of the atoms experimentally observed in the structure (ATOM records). These maps are combined with information in the SCOP database to provide sequences of protein domains. Selected subsets of the domain database, with varying degrees of similarity measured in several different ways, are also available. ASTRAL may be accessed at
PMCID: PMC102434  PMID: 10592239
25.  BAR-PLUS: the Bologna Annotation Resource Plus for functional and structural annotation of protein sequences 
Nucleic Acids Research  2011;39(Web Server issue):W197-W202.
We introduce BAR-PLUS (BAR+), a web server for functional and structural annotation of protein sequences. BAR+ is based on a large-scale genome cross comparison and a non-hierarchical clustering procedure characterized by a metric that ensures a reliable transfer of features within clusters. In this version, the method takes advantage of a large-scale pairwise sequence comparison of 13 495 736 protein chains also including 988 complete proteomes. Available sequence annotation is derived from UniProtKB, GO, Pfam and PDB. When PDB templates are present within a cluster (with or without their SCOP classification), profile Hidden Markov Models (HMMs) are computed on the basis of sequence to structure alignment and are cluster-associated (Cluster-HMM). Therefrom, a library of 10 858 HMMs is made available for aligning even distantly related sequences for structural modelling. The server also provides pairwise query sequence–structural target alignments computed from the correspondent Cluster-HMM. BAR+ in its present version allows three main categories of annotation: PDB [with or without SCOP (*)] and GO and/or Pfam; PDB (*) without GO and/or Pfam; GO and/or Pfam without PDB (*) and no annotation. Each category can further comprise clusters where GO and Pfam functional annotations are or are not statistically significant. BAR+ is available at
PMCID: PMC3125743  PMID: 21622657

Results 1-25 (798204)