Search tips
Search criteria 


Logo of narLink to Publisher's site
Nucleic Acids Res. 2013 January; 41(1): e26.
Published online 2012 October 11. doi:  10.1093/nar/gks919
PMCID: PMC3592395

Predicting the accuracy of multiple sequence alignment algorithms by using computational intelligent techniques


Multiple sequence alignments (MSAs) have become one of the most studied approaches in bioinformatics to perform other outstanding tasks such as structure prediction, biological function analysis or next-generation sequencing. However, current MSA algorithms do not always provide consistent solutions, since alignments become increasingly difficult when dealing with low similarity sequences. As widely known, these algorithms directly depend on specific features of the sequences, causing relevant influence on the alignment accuracy. Many MSA tools have been recently designed but it is not possible to know in advance which one is the most suitable for a particular set of sequences. In this work, we analyze some of the most used algorithms presented in the bibliography and their dependences on several features. A novel intelligent algorithm based on least square support vector machine is then developed to predict how accurate each alignment could be, depending on its analyzed features. This algorithm is performed with a dataset of 2180 MSAs. The proposed system first estimates the accuracy of possible alignments. The most promising methodologies are then selected in order to align each set of sequences. Since only one selected algorithm is run, the computational time is not excessively increased.


Multiple sequence alignment (MSA) is a widely used approach in the current molecular biology. This technique involves the comparison of new sequences with well-known ones, extracting their shared information and their significant differences (1). MSA methods have traditionally been essential for analyzing biological sequences and designing applications in structural modeling, functional prediction, phylogenetic analysis and sequence database searching (2). Current MSA tools are also applied to comparisons of protein structures (3), reconstructions of phylogenetic trees (4) or predictions of mutations (5) and interactions (6).

More recently, the interest of MSA methodologies has even increased due to new experimental techniques. Current technologies provide a large amount of data that must be analyzed, processed and assessed. Consequently, new computational strategies were required to extract biological meanings from such information. Thus, supervised learning algorithms have been widely implemented in the analysis of genomic and proteomic experimental data. Additionally, recent experimental methods also retrieve further biological data, which is useful for extending the information included within alignment methods. Thus, current MSAs tools take advantage of heterogeneous features, which are provided by recent biological progress in functional, structural and genomic researches, to obtain more accurate alignments within a reasonable time (7). Therefore, MSAs are becoming one of the more powerful and essential procedures of analysis (8).

Traditionally, alignment strategies are mainly incorporated in progressive algorithm and consistency-based methods (7). Progressive algorithms assemble previously built pairwise alignments through a clustering method and store their evaluations in a library. Some well-known programs using progressive strategies are ClustalW (9) or Muscle (10). On the other hand, consistency-based methodologies, e.g. T-Coffee (11) or MSAProbs (12), develop consistency scoring schemes, taking into consideration not only the previous pairwise alignments but also if these alignments are consistent with the final result. However, neither progressive nor consistency-based methods build optimal alignments when sequences are distantly related. More recent approaches, such as 3D-COFFEE (13) or Promals (14), include further information (structure, domains or homologies) in addition to the provided sequences. Such features are usually found by experimental resources in databases such as Protein Data Bank (PDB) (15), Uniprot (16) or Pfam (17). Nevertheless, the consumed time is still excessive for these strategies and improvements are only relevant when sequences are evolutionarily less related (7).

Therefore, currently there are many alignment methodologies based on different strategies. Moreover, each MSA tool usually depends on particular features; thereby, there is no consensus about which one produces more accurate alignments (18,19). A new intelligent algorithm based on least square support vector machine (LS-SVM) is proposed here in order to predict how accurately each MSA tool will align a set of sequences. Interesting features related to the sequences and their products have been added from several resources in order to make this prediction. To the best of our knowledge, there are no similar studies in the current bibliography, which address the prediction of the alignment accuracy. This algorithm also estimates which methodologies are more significant to align those sequences. The system has been created from 218 sets of sequences provided by the BAliBASE benchmark (20) and their corresponding features. Since our algorithm applies a priori features to predict the accuracy, only the best method is run. Consequently, the CPU cost is not excessively increased.


In this work, a novel system called ‘Prediction of Accuracy in Alignments based on Computational Intelligence’ (PAcAlCI) is developed. PAcAlCI is composed by four independent modules (Figure 1). First, 218 groups of sequences are aligned through 10 different methodologies, producing a dataset of 2180 alignments (‘Input Dataset’ module). Alignments are then evaluated in order to measure their accuracies. From these groups of sequences, several features are also retrieved from various relevant databases (‘Feature Extraction’ module). The most useful features are progressively included in a subset which is used by the subsequent algorithm (‘Feature Selection’ module). Finally, selected features are added to an LS-SVM model to predict alignment accuracies and, subsequently, the most suitable methodologies (‘LS-SVM Prediction’ module). The PAcAlCI system was completely implemented with Matlab (Version R2010b). The source code is available at

Figure 1.
PAcAlCI scheme. The architecture is developed into four modules: input dataset, feature extraction, feature selection and LS-SVM prediction.

Input dataset

A set of sequences must be considered in order to compare different alignment algorithms and develop the proposed prediction. Several datasets and techniques have usually been developed to standardize the comparison of alignment results, e.g. Oxbench (21), HOMSTRAD (22), Prefab (10) or BAliBASE (20). In this work, the BAliBASE benchmark (v3.0) was chosen.

BAliBASE defines several groups of sequences that can be easily aligned by standard algorithms. This dataset includes a total of 218 sets of sequences that were manually extracted from different databases, specifically the PDB (15). This benchmark also provides a set of handmade reference alignments (gold standard) in order to compare them with the alignments obtained by other tools. Thus, BAliBASE calculates a Sum-of-Pairs (SP) score to evaluate such alignments. These SP scores are used by our system to measure the quality of each alignment.

Sequences in BAliBASE are classified in the next subsets: (i) equidistant PDB sequences with An external file that holds a picture, illustration, etc.
Object name is gks919i1.jpg35 insertions and sharing An external file that holds a picture, illustration, etc.
Object name is gks919i2.jpg20% of identity between any pair of sequences (RV11 and RV12 subsets); (ii) PDB orphan sequences of families with An external file that holds a picture, illustration, etc.
Object name is gks919i3.jpg40% of identity and at least one known 3D structure (RV20 subset); (iii) subfamilies of sequences that share An external file that holds a picture, illustration, etc.
Object name is gks919i4.jpg40% of identity but An external file that holds a picture, illustration, etc.
Object name is gks919i5.jpg20% with other subfamilies (RV30 subset); (iv) sequences with An external file that holds a picture, illustration, etc.
Object name is gks919i6.jpg20% of identity and large terminal extensions (RV40 subset) and (v) sequences with An external file that holds a picture, illustration, etc.
Object name is gks919i7.jpg20% of identity and internal insertions (RV50 subset). From these subsets, it will be possible to establish, for example, which methodology is better when less related sequences are aligned or what differences are found when the methodologies include additional information. These questions will be solved in the ‘Comparison of MSA Methodologies’ section.

MSA methodologies

Ten of the most relevant MSA tools are selected to be included in PAcAlCI. These tools are classified according to their implemented strategy: progressive techniques, consistency-based methods or algorithms including additional information (see summary in Table 1). Programs were run with their default features. Among the progressive methods, ClustalW (9), Muscle (10), Kalign (23), Mafft (24) and RetAlign (25) were chosen. ClustalW designs a tree-computing algorithm to find the alignment by means of distance scores and a gap weighting scheme. Muscle develops a strategy based on three stages, where a quickly built alignment is refined with an iterative method and a tree-dependent partitioning approach. Kalign uses the Wu–Manber string-matching algorithm (26) to improve the distance calculation of the classical progressive approach. Mafft identifies common homologies in sequences through a fast Fourier transform, significantly reducing the computational cost. Lastly, RetAlign implements a progressive corner-cutting algorithm to identify optimal alignments in a network of possible alignments.

Table 1.
Summary of applied methodologies

Other three consistency-based approaches were included in PAcAlCI: T-Coffee (11), ProbCons (27) and Fast Statistical Alignment (FSA) (28). T-Coffee develops a standard consistency algorithm, building pairwise alignments and evaluating them against third sequences. ProbCons defines a probabilistic consistency based on a pair of hidden Markov models (pair-HMMs) to perform a novel scoring scheme for the standard consistency library. FSA estimates the insertion and deletion processes in sequences through pair-HMMs to combine their probabilities into alignments.

Finally, two more complex methodologies, namely 3D-Coffee (13) and Promals (14) were also applied. 3D-Coffee introduces structural information in the standard T-Coffee evaluations from the PDB (15), performing comparisons between each two structures and each sequence with its structure. On the other hand, Promals adds information based on homologies, combining sequences and homologies in profiles through HMMs.

Databases and feature extraction

Features of BAliBASE sequences are extracted from well-known biological databases. Such databases are consulted to obtain interesting data which complement the sequences and to build a complete set of features. The final dataset will be composed by 23 features (see summary in Table 2).

Table 2.
Summary of features extracted from several databases

Some features related to sequences, domains, amino acid types or structures have already been successfully included in other similar knowledge-based systems (18,29). However, the set of features is complemented with further information based on other studies such as protein interaction prediction (30) or protein model classification (31). Therefore, a more complete feature environment is presented in this work in order to study its relevance to sequence alignments. Here below, each consulted database is described, indicating which features have been retrieved and their nomenclature in the feature list:

  • BAliBASE (20) can be considered the first consulted database, as it provides the sequences that are aligned. Then, the features associated with each set of sequences are the number of sequences (An external file that holds a picture, illustration, etc.
Object name is gks919i33.jpg), the average length of sequences (An external file that holds a picture, illustration, etc.
Object name is gks919i34.jpg) and the normalized variance of the sequence length (An external file that holds a picture, illustration, etc.
Object name is gks919i35.jpg). This information determines whether there is any dependence between alignment tools and the number/length of sequences. Since BAliBASE classifies sequences according to certain features (see the ‘Input Dataset’ section for details), the subset, where each set of sequences is included, is proposed as another feature (An external file that holds a picture, illustration, etc.
Object name is gks919i36.jpg).
  • Uniprot (16) consists of a wide repository of proteins with accurate, consistent and rich annotation. Several features are calculated from this database: the percentage of amino acids in An external file that holds a picture, illustration, etc.
Object name is gks919i37.jpg-helix structures (An external file that holds a picture, illustration, etc.
Object name is gks919i38.jpg), the percentage of amino acids in An external file that holds a picture, illustration, etc.
Object name is gks919i39.jpg-strand structures (An external file that holds a picture, illustration, etc.
Object name is gks919i40.jpg) and the percentage of amino acids in the transmembrane region (An external file that holds a picture, illustration, etc.
Object name is gks919i41.jpg). Data associated with similar secondary structures or locations usually indicate more related sequences or regions.
  • Pfam (17) identifies common functional regions in families, also called domains. Domain features are performed from this database as: average number of domains per sequence (An external file that holds a picture, illustration, etc.
Object name is gks919i42.jpg) and average number of shared domains (between each pair of sequences) per sequence (An external file that holds a picture, illustration, etc.
Object name is gks919i43.jpg). Domain similarities usually imply regions with related functionality. This functionality can be useful to understand how some sequences must be efficiently aligned or how close sequences are in their families.
  • The Gene Ontology Annotation (32) provides controlled vocabularies for the annotations of molecular attributes in different model organisms. These vocabularies are classified into three structured ontologies organized as a directed acyclic graph (DAG): molecular function (MF), cellular component (CC) and biological process (BP). Features used in this work from Gene Ontology Annotation (GOA) are average number of annotated terms per sequence (An external file that holds a picture, illustration, etc.
Object name is gks919i44.jpg); average number of annotated terms for each ontology per sequence: MF (An external file that holds a picture, illustration, etc.
Object name is gks919i45.jpg), CC (An external file that holds a picture, illustration, etc.
Object name is gks919i46.jpg) and BP (An external file that holds a picture, illustration, etc.
Object name is gks919i47.jpg) and average number of shared GO terms (between each pair of sequences) per sequence (An external file that holds a picture, illustration, etc.
Object name is gks919i48.jpg).
  • PDB (15) includes information about experimentally determined 3D structures of each protein. The average number of annotated PDB structures per sequence (An external file that holds a picture, illustration, etc.
Object name is gks919i49.jpg), the percentage of sequences with structures (An external file that holds a picture, illustration, etc.
Object name is gks919i50.jpg) and the average number of shared structures (between each pair of sequences) per sequence (An external file that holds a picture, illustration, etc.
Object name is gks919i51.jpg) are proposed from this database.

Apart from these databases, other resources have been applied in order to complete the set of features. For instance, the classification of amino acids included in (33) has been applied to define: the percentage of polar uncharged amino acids [G,A,P,V,L,I,M] (An external file that holds a picture, illustration, etc.
Object name is gks919i52.jpg), the percentage of non-polar aliphatic amino acids [S,T,C,N,Q] (An external file that holds a picture, illustration, etc.
Object name is gks919i53.jpg), the percentage of basic positively charged amino acids [K,R,H] (An external file that holds a picture, illustration, etc.
Object name is gks919i54.jpg), the percentage of aromatic amino acids [F,W,Y] (An external file that holds a picture, illustration, etc.
Object name is gks919i55.jpg) and the percentage of negatively charged amino acids [D,E] (An external file that holds a picture, illustration, etc.
Object name is gks919i56.jpg).

Finally, the MSA method being executed (see the ‘MSA methodologies’ section) is the last included feature (An external file that holds a picture, illustration, etc.
Object name is gks919i57.jpg). This feature is determinant in the proposed system, as the purpose is to predict the accuracy of each method according to all these features. Besides, the most suitable methods according to the predicted accuracies are selected from that feature. Also, the accuracies of each method are included as outputs. As explained before, this accuracy is called SP score by BAliBASE and it is defined as a similarity value against the gold-standard references.

Feature selection based on mutual information

The relevance of the previously proposed features is analyzed through a feature selection procedure. Feature selection algorithms allow reducing the number of features, filtering out those irrelevant or redundant. One of the well-known feature selection, called minimal-redundancy-maximal-relevance (mRMR) (34), is applied in this work. First, this approach calculates the relevance of the features by using their mutual information. The obtained relevance is then assessed through the subsequent machine-learning procedure. The aim of mRMR is to select a feature at a time with a first-order incremental search, trying to avoid redundant features (see Supplementary mRMR feature selection for details). Discrete and continuous random variables are both considered in the mRMR algorithm. Such property is essential in the proposed set of features, since both types of variables were included (‘real’ and ‘integer’ types in Table 2). Besides, the output accuracy is also defined as a continuous variable.

The mRMR method achieves a great accuracy in a reduced time. Thus, the algorithm is useful to accurately select features among a huge number of them. The proposed features are then ranked from mRMR. Subsequently, the LS-SVM model is trained and evaluated progressively increasing the number of features from this ranking.

Least squares support vector machine

Features selected before are included in an LS-SVM model (35) in order to estimate different alignment accuracies. Subsequently, the algorithm also determines which tools are more likely to obtain the best alignment in term of accuracy. LS-SVMs models were generally designed for prediction approaches, but they present the most effective performance for regression problems. Since our system includes continuous values of accuracy, the proposed prediction is defined as a regression problem. Additionally, as the order of input data in LS-SVM is arbitrary, any change of the order would not affect the modeling result (35,36). Therefore, the application of LS-SVM would be an effective and faithful solution.

A kernel model must also be selected to correctly design the prediction method based on LS-SVM. The radial basis function (RBF) kernel was chosen to be applied in the proposed methodology. Additionally, LS-SVMs based on RBF kernels must also be performed by two kinds of hyper-parameters: the regulation parameter and the kernel parameters. These hyper-parameters were optimized by cross-validation in the proposed LS-SVM system (details about kernels and hyper-parameters in LS-SVMs are provided in the Supplementary LS-SVM models). The LS-SVM algorithm is developed here from the Matlab toolbox found in (37).

In order to assess the LS-SVM model, a 10-fold cross-validation procedure is performed. This procedure randomly divides the complete dataset (2180 problems) into 10 subsets of 218 problems. Nine subsets are then applied to train the proposed system. The training procedure includes the most relevant features and the posterior accuracy for each problem in the subset. Thus, hyper-parameters are tuned and the LS-SVM model is estimated. Subsequently, the last subset is used to test the estimated LS-SVM model. The accuracies from such subset are then predicted and compared with those already known. The training and test procedures are repeated 10 times with the 10 different subsets. The predicted accuracies are then validated by their errors against real ones. The prediction error is measured by means of the ‘mean relative error’ (MRE). Taking into account this error value, a confidence interval is proposed to select the most suitable methodologies. For a specific set of sequences, those methodologies whose accuracies exceed a confidence value (An external file that holds a picture, illustration, etc.
Object name is gks919i58.jpg) are selected (see the MRE and An external file that holds a picture, illustration, etc.
Object name is gks919i59.jpg equations in the Supplementary LS-SVM validation).


Comparison of MSA methodologies

As described before, each MSA method proposes different solutions depending on certain conditions or features. For this reason, biologists and researchers do not agree with a generally accepted solution (19). Some methods have been developed to unify criteria and choose the most suitable alignment tool (21,22), but this is currently an open issue.

In order to understand the performance of MSAs, accuracies from several methodologies can be compared. Previous reviews (7,8) have already compared accuracies from BAliBASE subsets (SP scores) against the applied strategy (progressive, consistency-based or approaches with additional data). Generally, SP scores are quite similar independently of the methodologies. Only when more distant sequences are provided (<20% of identity), accuracies are significantly higher in methods including additional data. However, these strategies including additional data are clearly in disadvantage in terms of required time (7). Thus, we could suggest that, only in special cases with less related sequences, additional data are clearly useful.

This analysis supports the idea of using a system to previously decide which methodologies are most promising to obtain better alignments. Here, PAcAlCI predicts accuracies to decide whether differences are enough to select more sophisticated methods against faster ones. Therefore, this system not only predicts the most relevant methodologies, but it also estimates differences between alignment performances. We could then decide which method constructs an accurate enough alignment according to its predicted accuracy.

Selection of feature subset

The complete dataset was composed by 2180 different inputs. Such inputs were retrieved from the 218 groups of sequences of BAliBASE. They were then aligned by the 10 previously proposed algorithms. For each input alignment, a set of 23 features was also retrieved. Output values were represented by the 2180 accuracies calculated from the input alignments.

As described above, the mRMR algorithm (34) was applied to select significant features. That procedure returned a ranking of features according to their relevance against calculated accuracies. An increasingly higher subset of features was then included in the subsequent system. According to this ranking (Table 2), the most relevant features were ‘the number of domains’ (An external file that holds a picture, illustration, etc.
Object name is gks919i60.jpg) and ‘the applied methodology’ (An external file that holds a picture, illustration, etc.
Object name is gks919i61.jpg). Regarding the first one, domains can be considered a measure of how deeply sequences are known. Domains are also associated with functional relationships and they involve more conserved sequence sections. Then, sequences that include more number of domains will be harder to align and, subsequently, the system could provide accurate predictions. On the other hand, the second feature is an essential variable because it is including obligatory information. This feature must always be included in order to know for which methodology the prediction is done, developing a robust and coherent system of prediction.

The features related to sequences, ‘the number of sequences’ (An external file that holds a picture, illustration, etc.
Object name is gks919i62.jpg) and ‘the average/variance of the length’ (An external file that holds a picture, illustration, etc.
Object name is gks919i63.jpg), were also ranked among first positions in the ranking. These features are highlighted because the availability to obtain accurate alignments directly depends on sequence properties. Other features less related to sequences but including amino acid information were found in the first half of the ranking. Features such as ‘types of amino acids’ (An external file that holds a picture, illustration, etc.
Object name is gks919i64.jpg) or ‘the secondary structure’ (An external file that holds a picture, illustration, etc.
Object name is gks919i65.jpg) provide complementary information about the composition and formation of sequences. Thus, they can also be helpful to efficiently predict some similarities.

Additionally, it is also important to analyze the occurrences in BAliBASE of the selected features in order to understand the obtained feature selection. BAliBASE sequences usually have known secondary structures (An external file that holds a picture, illustration, etc.
Object name is gks919i66.jpg-helix or An external file that holds a picture, illustration, etc.
Object name is gks919i67.jpg-strand) or GO terms. However, PAcAlCI was also trained with a few datasets from BAliBASE where these features are not known; thereby, cases without this information were also considered. Thus, new datasets from users not including that information can also be accurately estimated, returning their predicted accuracies and a set of suitable methods to use. On the other hand, datasets associated with other features (e.g. transmembrane regions) are considerably less included in BAliBASE. Consequently, the significance obtained for such feature was considered irrelevant and it was discarded from the selection procedure (for example, the ‘transmembrane amino acids’ feature was ranked in the 22nd position).

Prediction of alignment accuracy

Features previously analyzed were added to the subsequent LS-SVM model. PAcAlCI then predicted the accuracy which each methodology returned for a set of sequences. As far as we are concerned, similar accuracy predictions in MSAs have not been addressed before. PAcAlCI was performed using an incremental combination of features in ascendant relevance order. Such combination was applied adding a feature at a time according to the previous ranking. Finally, a 10-fold cross-validation was performed to assess the algorithm. The prediction error (MRE) was calculated for the training and test sets.

The evolution of the errors for each combination of features is shown in Figure 2. According to such evolution, the error progressively decreases with higher number of features. However, an almost optimal value is reached from around 10 features. The prediction error is then kept around 6% for the training data and 9% for the test data. So, we could suggest that all features are not necessary to obtain the optimal prediction. A smaller number of features was then used to perform the system without lack of accuracy. Specifically, the 10 most relevant features were added to the LS-SVM model. According to this configuration, accuracies predicted from four sets of sequences are shown in Table 3. The total MRE value returned by PAcAlCI was 0.0587 for the training set and 0.1012 for the test. This error is distributed along the 2180 predicted accuracies as shown in Figure 3.

Figure 2.
Evolution of the MRE. The number of features progressively increases in ascendant relevance order. Training and test errors are shown.
Figure 3.
Distribution of relative errors for training and test sets. The corresponding LS-SVM prediction was performed using 10 features.
Table 3.
Accuracies obtained for four different sets of sequences

Analyzing more deeply the proposed system, higher error values are less frequent and they are usually associated with low accuracies (see detail in Figure 3). Alignments with low accuracies are less meaningful in our system, as their performances are totally unacceptable. Consequently, they could not even be considered in PAcAlCI. A minimal accuracy value, called An external file that holds a picture, illustration, etc.
Object name is gks919i68.jpg, was then defined as a threshold. Thus, the LS-SVM model only kept the most accurate alignments, filtering out the remaining ones. This threshold allowed improving the subsequent prediction. For example, if An external file that holds a picture, illustration, etc.
Object name is gks919i69.jpg, the MRE value using 10 input features decreased to 0.0340 in the training set and to 0.0608 in the test. As appreciated, error values were reduced by >2% and >4% for the training and test set, respectively. These errors improved because low values of accuracy, which led to highly wrong predictions, were previously filtered (see the new distribution of errors in Figure 4). Prediction errors are now considered low enough to adequately determine differences between methodologies. Then, the most suitable alignments can be selected according to their predicted accuracies.

Figure 4.
Distribution of relative errors for training and test sets. Low accuracies were previously filtered to improve the LS-SVM prediction, avoiding prediction with high errors (An external file that holds a picture, illustration, etc.
Object name is gks919i70.jpg).

Selection of alignment methods

In occasions, several alignment methodologies obtain quite similar accuracies; thereby, no one significantly stands out from the rest. Consequently, as in other few researches (29,38), the most promising MSA tools can be selected according to several features. In this case, a confidence interval was defined to decide those methodologies which acceptably align a set of sequences. The confidence interval covers those accuracy values that are higher than a confidence value An external file that holds a picture, illustration, etc.
Object name is gks919i71.jpg (see its formal definition in Supplementary LS-SVM validation). Those methodologies whose accuracies exceed such confidence value were chosen as candidate methods.

This confidence interval was applied to real accuracies and predicted accuracies. Two sets of suitable methodologies were then retrieved (real and predicted sets). Thus, the number of selected methodologies was variable for each group, as it depends on how similar accuracies were in that specific problem. Both groups were then compared in order to know how many methodologies were correctly selected in the predicted set (see four examples in Figure 5). For example, using accuracies from the performance of 10 features without An external file that holds a picture, illustration, etc.
Object name is gks919i72.jpg threshold, the 83.55% of predicted methodologies were also included in the real group. When accuracies were predicted with the limitation An external file that holds a picture, illustration, etc.
Object name is gks919i73.jpg, the percentage of successfully selected methodologies increased to 85.89%. Therefore, the proposed system usually performed an accurate group of outstanding methodologies.

Figure 5.
Intersection of suitable and predicted methodologies (Venn diagrams) corresponding to the four alignments whose accuracies are shown in Table 3.

As shown in the examples of Figure 5, methodologies including additional information, namely 3D-Coffee and Promals, were selected for sequences with low similarity (RV11 subset). In these cases, more commonly used aligners (ClustalW, Kalign or Muscle) were not selected, as they did not build accurate enough alignments. However, these more complex methods (3D-Coffee and Promals) did not significantly outperform other faster methods when sequences were more related. Thus, methodologies as Mafft, T-Coffee, Kalign or ProbCons were also selected when sequences were highly related (>20% of similarity). Consequently, we could again suggest that the prediction system is working as expected. For instance, those datasets including more than two domains per sequence selected Kalign as a suitable method (in the 80.95% of cases), whereas Mafft was appropriate for datasets with less domains (78% of datasets selected it). Regarding the size of the dataset, large datasets (>50 sequences or >400 amino acids of average length) usually picked both Mafft and Kalign (90.17 and 71.9%, respectively), while ProbCons was chosen for shorter datasets (62.89%). Finally, Kalign also suited when the sequence lengths in datasets have a high variability (a difference of >100 amino acids in average between sequence lengths) and ProbCons for low variability (69.05% and 65.89% of cases, respectively).

Although there are other expert systems to select adequate MSA tools (38,39), PAcAlCI was compared with AlexSys (29), as it performs a more similar strategy (see comparison in Table 4). However, comparing both methodologies can be complicated. Both systems develop similar machine-learning approaches, but their objectives are quite different. AlexSys defines a decision-tree approach to predict whether sequences are ‘strongly’ or ‘weakly’ aligned with each specific method (classification problem and binary solution). The best method among those classified as ‘strong’ is then inferred according to their success probability or their required CPU time. This binary classification can be quite subjective in some cases. Since accuracies over 0.5 are already classified as ‘strong’, quite different accuracies, e.g. 0.5 and 0.9, are considered identical in the AlexSys approach. In a different way, PAcAlCI first predicts accuracy values (regression problem and real solution). The accuracy prediction provides a relevant improvement in order to decide whether it is worth aligning with a specific methodology. Besides, suitable methods are also selected according to the best accuracies. In general, AlexSys correctly predicts the best aligner in a 45% of its test alignments. In another 45.5% of the alignments, the best aligner corresponded to the second predicted method. In general, global success rates in PAcAlCI are quite similar (83.55% or 85.89% depending on the An external file that holds a picture, illustration, etc.
Object name is gks919i76.jpg threshold), although the number of suitable methods is usually higher in our prediction. Regarding the included tools, PAcAlCI is composed by a wider group of previous methodologies (10 approaches compared with the six of AlexSys), including more complex ones as 3D-Coffee or Promals.

Table 4.
Comparison between PAcAlCI and AlexSys

Despite these differences, both methods may be considered complementary, as both perform accurate classifiers but in different contexts. In any case, the final decision of selecting the most suitable methodology among the proposed ones can rely on the final user of this system. Other criteria such as the complexity of parameters in the methodologies or the required time could also be taken into account in order to choose the correct tool among the selected ones.


MSA is currently an open issue. Alignment tools must be continually improved, as they are essential in the analysis of huge amount of data provided by next-generation sequencing and high-throughput experiments. Thus, new trends in MSAs aim to integrate the major amount of information while trying to significantly reduce the used time. For this reason, efficient computational techniques are increasingly implemented.

In this work, a complete study of MSA methodologies has been developed. Relevant methodologies in this field were first compared. Several types of methods were discussed and we have suggested that only in special cases more sophisticated approaches including additional information are really necessary. A novel intelligent system (PAcAlCI) was then proposed based on the knowledge acquired from this study.

PAcAlCI was designed in order to predict the accuracy that each alignment method reaches for a specific set of sequences. This information gives us an idea of how accurately each methodology works. The mRMR feature selection technique was first applied to 23 features previously retrieved from several biological databases. We have also described how the system can be performed with only the 10 most relevant features to predict accuracies with a reasonable efficiency. Finally, we have proposed the outstanding methodologies which can be used for certain sequences according to their predicted accuracies. In this sense, the proposed algorithm is able to successfully select the most outstanding methods according to the previously predicted accuracies.


Supplementary Data are available at NAR Online: Supplementary mRMR feature selection, Supplementary LS-SVM models, Supplementary LS-SVM validation and Supplementary References [40–47].


The Spanish CICYT [Project SAF2010-20558]; the Government of Andalusia [Project P09-TIC-175476]. Funding for open access charge: the Government of Andalusia [Project P09-TIC-175476].

Conflict of interest statement. None declared.

Supplementary Material

Supplementary Data:


1. Attwood TK, Parry-Smith DJ. Introduction to Bioinformatics. Prentice Hall: Pearson Education; 2002.
2. Pei J. Multiple protein sequence alignment. Curr. Opin. Struct. Biol. 2008;18:382–386. [PubMed]
3. Gelly JC, Joseph AP, Srinivasan N, de Brevern AG. iPBA: a tool for protein structure comparison using sequence alignment strategies. Nucleic Acids Res. 2011;39:W18–W23. [PMC free article] [PubMed]
4. Wang LS, Leebens-Mack J, Wall PK, Beckmann K, dePamphilis CW, Warnow T. The impact of multiple protein sequence alignment on phylogenetic estimation. IEEE–ACM Trans. Comput. Biol. Bioinform. 2011;8:1108–1119. [PubMed]
5. Hicks S, Wheeler DA, Plon SE, Kimmel M. Prediction of missense mutation functionality depends on both the algorithm and sequence alignment employed. Hum. Mutat. 2011;32:661–668. [PubMed]
6. Li AX, Marz M, Qin J, Reidys CM. RNA–RNA interaction prediction based on multiple sequence alignments. Bioinformatics. 2011;27:456–463. [PubMed]
7. Kemena C, Notredame C. Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics. 2009;25:2455–2465. [PMC free article] [PubMed]
8. Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinform. 2010;11:473–483. [PMC free article] [PubMed]
9. Thompson JD, Higgins DG, Gibson TJ. ClustalW: improving the sensitivity of progressive multiple sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4673–4680. [PMC free article] [PubMed]
10. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. [PMC free article] [PubMed]
11. Notredame C, Higgins DG, Heringa J. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol Biol. 2000;302:205–217. [PubMed]
12. Liu Y, Schmidt B, Maskell DL. MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities. Bioinformatics. 2010;26:1958–1964. [PubMed]
13. O’Sullivan O, Suhre K, Abergel C, Higgins DG, Notredame C. 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol. 2004;340:385–395. [PubMed]
14. Pei J, Grishin NV. PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics. 2007;23:802–808. [PubMed]
15. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. [PMC free article] [PubMed]
16. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang HZ, Lopez R, Magrane M, et al. UniProt: the Universal Protein knowledgebase. Nucleic Acids Res. 2004;32:D115–D119. [PMC free article] [PubMed]
17. Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, et al. The Pfam protein families database. Nucleic Acids Res. 2010;38:D211–D222. [PMC free article] [PubMed]
18. Nuin PAS, Wang ZZ, Elisabeth RM. The accuracy of several multiple sequence alignment programs for proteins. BMC Bioinformatics. 2006;7:471. [PMC free article] [PubMed]
19. Sierk ML, Smoot ME, Bass EJ, Pearson WR. Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments. BMC Bioinformatics. 2010;11:146. [PMC free article] [PubMed]
20. Thompson JD, Koehl P, Ripp R, Poch O. BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins. 2005;61:127–136. [PubMed]
21. Raghava GPS, Searle SMJ, Audley PC, Barber JD, Barton GJ. OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics. 2003;4:47. [PMC free article] [PubMed]
22. Stebbings LA, Mizuguchi K. HOMSTRAD: recent developments of the homologous protein structure alignment database. Nucleic Acids Res. 2004;32:D203–D207. [PMC free article] [PubMed]
23. Lassmann T, Sonnhammer EL. Kalign—an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics. 2005;6:298. [PMC free article] [PubMed]
24. Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002;30:3059–3066. [PMC free article] [PubMed]
25. Szabo A, Novak A, Miklos I, Hein J. Reticular alignment: a progressive corner-cutting method for multiple sequence alignment. BMC Bioinformatics. 2010;11:570. [PMC free article] [PubMed]
26. Wu S, Manber U. Fast text searching allowing errors. Commun. ACM. 1992;35:83–91.
27. Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S. ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res. 2005;15:330–340. [PubMed]
28. Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L. Fast statistical alignment. PLoS Comput. Biol. 2009;5:5. [PMC free article] [PubMed]
29. Aniba MR, Poch O, Marchler-Bauer A, Thompson JD. AlexSys: a knowledge-based expert system for multiple sequence alignment construction and analysis. Nucleic Acids Res. 2010;38:6338–6349. [PMC free article] [PubMed]
30. Wu XM, Zhu L, Guo J, Zhang DY, Lin K. Prediction of yeast protein–protein interaction network: insights from the Gene Ontology and annotations. Nucleic Acids Res. 2006;34:2137–2150. [PMC free article] [PubMed]
31. Roslan R, Othman RM, Shah ZA, Kasim S, Asmuni H, Taliba J, Hassan R, Zakaria Z. Utilizing shared interacting domain patterns and Gene Ontology information to improve protein–protein interaction prediction. Comput. Biol. Med. 2010;40:555–564. [PubMed]
32. Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R. The Gene Ontology Annotation (GOA) database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 2004;32:D262–D266. [PMC free article] [PubMed]
33. Mathews CK, Holde KEV, Ahern KG. Biochemistry. Redwood City, CA: Benjamin Cummings Publishing; 2000.
34. Peng H, Long F, Ding C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005;27:1226–1238. [PubMed]
35. Suykens JAK, Van Gestel T, De Brabanter J, De Moor B, Vandewalle J. Least Squares Support Vector Machines. Singapore: World Scientific Pub. Co. Inc.; 2003.
36. Li L, Su H, Chu J. Sparse representation based on projection method in online least squares support vector machines. J. Control Theory Appl. 2009;7:163–168.
37. De Brabanter K, Karsmakers P, Ojeda F, Alzate C, De Brabanter J, Pelckmans K, De Moor B, Vandewalle J, Suykens JAK. 2011 LS-SVMLab: a MATLAB toolbox for least squares support vector machines (v1.8). (21 September 2012, date last accessed)
38. Anderson CL, Strope CL, Moriyama EN. SuiteMSA: visual tools for multiple sequence alignment comparison and molecular sequence simulation. BMC Bioinformatics. 2011;12:184. [PMC free article] [PubMed]
39. Thompson JD, Muller A, Waterhouse A, Procter J, Barton GJ, Plewniak F, Poch O. MACSIMS: multiple alignment of complete sequences information management system. BMC Bioinformatics. 2006;7:318. [PMC free article] [PubMed]
40. Estevez PA, Tesmer M, Perez CA, Zurada JM. Normalized mutual information feature selection. IEEE Trans. Neural Netw. 2009;20:189–201. [PubMed]
41. John G, Kohavi R, Pfleger K. Irrelevant features and the subset selection problem. In International Conference on Machine Learning. 1994:121–129.
42. Bins J, Draper BA. Feature selection from huge feature sets. In 8th IEEE International Conference on Computer Vision. 2001;2:159–165.
43. Cover TM, Thomas JA. Elements of Information Theory. New York, NY, USA: Wiley-Interscience; 2006.
44. Kullback S. Information Theory and Statistics. New York, NY, USA: Courier Dover Publications; 1997.
45. Babich GA, Camps OI. Weighted Parzen windows for pattern classification. IEEE Trans. Pattern Anal. Mach. Intell. 1996;18:567–570.
46. Hestenes MR, Stiefel E. Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand. 1952;49:409–436.
47. Rossi F, Lendasse A, Francois D, Wertz V, Verleysen M. Mutual information for the selection of relevant variables in spectrometric nonlinear modelling. Chemometrics Intell. Lab. Syst. 2006;80:215–226.

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press