Due to the increasing speed and number of genome sequencing projects, the gap between the number of known structures and the number of known protein sequences keeps increasing. As a result, demand for reliable computational methods today is higher than ever, while in silico estimation of protein structures remains one of the most challenging tasks in bioinformatics.
The central assumption of comparative bioinformatics methods for proteins is that the structures of proteins are more conserved than their amino-acid sequences. This allows homology modelling, namely, mapping the structure of a sequence onto homologous sequences. As insertions and deletions separating two homologous sequences accumulate, homologous characters in the two sequences will occupy different positions, which causes a non-trivial problem of identifying homologous positions. This problem can be solved by sequence alignment algorithms [1
], which maximise the similarity between aligned positions while also minimise the insertions and deletions needed to align the sequences.
The relationship between gap-penalties and similarity scores can be set such that they maximise the number of correctly aligned positions in a benchmark set of alignments [5
]. By contrast, stochastic models are capable to calibrate their parameters by applying a Maximum Likelihood approach even if no benchmark set is available. Hidden Markov Models were the first such stochastic models which have appeared in bioinformatics more than ten years ago [7
]. Thorne, Kishino and Felsenstein introduced time-continuous Markov models for describing insertion and deletion events [8
], and they showed on simulated data that the Maximum Likelihood method could correctly estimate the insertion-deletion as well as the substitution parameters with which the simulated data had been generated. The TKF models have subsequently been improved [10
], and have been tested for alignment accuracy on biological data [11
]. Above automatic parameter estimation, the other main advantage of stochastic models is that such models can provide posterior probabilities for each estimated alignment column as well as for the whole alignment, and these posterior probabilities correlate with the probability for the alignment column being correctly aligned [11
The uncertainty in the sequence alignment can be slightly reduced when more than two sequences are simultaneously aligned together, and hence, much effort has been put in developing accurate multiple sequence alignment methods. Although efficient algorithms exist for any type of pairwise alignment problem, the multiple sequence alignment problem is hard. It has been proved that the optimal multiple sequence alignment problem under the sum-of-pairs scoring scheme is NP-hard [14
], and it is strongly believed that the statistical approach to multiple sequence alignment is algorithmically not simpler than score-based approaches. Since it is unlikely that fast algorithms exist for any type of exact multiple sequence alignment problem, heuristic approaches have become widespread. Profile-HMM methods [15
] align sequences to a profile-HMM instead of each other, and the multiple sequence alignment is obtained by aligning sequences together via a profile-HMM. Since the jumping and emission parameters of the HMM are learned from the data, this approach needs many sequences for parameter optimisation. Nevertheless, profile-HMMs do not consider evolutionary relationships amongst sequences, and hence, they cannot handle properly over-representation of evolutionary groups.
Iterative approaches have been introduced for score-based methods in the eighties [17
] and have recently been extended for stochastic methods [13
] using the transducer theory [20
]. The drawback of iterative approaches is that in each iteration, they consider only a single, locally optimal alignment that might not lead to a globally optimal alignment. Moreover, as they consider only locally optimal partial solutions, they naturally underestimate the uncertainty of posterior probabilities.
The Markov chain Monte Carlo (MCMC) method represents a third way to attack the multiple stochastic alignment problem. It was first introduced for assessing the Bayesian distribution of evolutionary parameters of the TKF91 model aligning two sequences [22
], and has subsequently been extended to multiple sequence alignment [23
]. The general theory of Markov chain Monte Carlo [29
] states that the Markov chain will be in the prescribed distribution after infinite number of random steps. Obviously, we cannot wait infinite many steps in practice, and therefore the success of MCMC methods depends on fast convergence: if the Markov chain converges quickly to the prescribed distribution, the bias of samples from the Markov chain after a limited number of steps will be negligible. The convergence can be checked by measuring autocorrelation in the log-likelihood trace or a few other statistics of the Markov chain and by running several parallel chains with different random starting points [31
Since the above mentioned methods for multiple stochastic sequence alignment problems have been introduced only recently, no large-scale, comprehensive analysis on the performance of methods for protein structure prediction has been published yet. In this paper, we present a survey on how stochastic alignment methods can be used for protein secondary structure predictions. The prediction can be based on pairwise or multiple alignments and in both cases, either only a single, optimal alignment or the whole posterior distribution of alignments is used for prediction. We are interested in the question how much one can gain by involving more sequences and the posterior distribution of the alignments into the secondary structure prediction.