|Home | About | Journals | Submit | Contact Us | Français|
Deep learning is one of the most powerful machine learning methods that has achieved the state-of-the-art performance in many domains. Since deep learning was introduced to the field of bioinformatics in 2012, it has achieved success in a number of areas such as protein residue-residue contact prediction, secondary structure prediction, and fold recognition. In this work, we developed deep learning methods to improve the prediction of torsion (dihedral) angles of proteins.
We design four different deep learning architectures to predict protein torsion angles. The architectures including deep neural network (DNN) and deep restricted Boltzmann machine (DRBN), deep recurrent neural network (DRNN) and deep recurrent restricted Boltzmann machine (DReRBM) since the protein torsion angle prediction is a sequence related problem. In addition to existing protein features, two new features (predicted residue contact number and the error distribution of torsion angles extracted from sequence fragments) are used as input to each of the four deep learning architectures to predict phi and psi angles of protein backbone. The mean absolute error (MAE) of phi and psi angles predicted by DRNN, DReRBM, DRBM and DNN is about 20–21° and 29–30° on an independent dataset. The MAE of phi angle is comparable to the existing methods, but the MAE of psi angle is 29°, 2° lower than the existing methods. On the latest CASP12 targets, our methods also achieved the performance better than or comparable to a state-of-the art method.
Our experiment demonstrates that deep learning is a valuable method for predicting protein torsion angles. The deep recurrent network architecture performs slightly better than deep feed-forward architecture, and the predicted residue contact number and the error distribution of torsion angles extracted from sequence fragments are useful features for improving prediction accuracy.
The online version of this article (10.1186/s12859-017-1834-2) contains supplementary material, which is available to authorized users.
The conformation of the backbone of a protein can be largely represented by two torsion angles (phi and psi angles) associated with each Cα atom. A number of methods, mostly data-driven machine learning methods, have been developed to predict torsion angles from protein sequences; and the predictions can then be used as restraints to predict protein tertiary structures.
The first real-value psi angle prediction method based on machine learning, DESTRUCT, was proposed by Wood and Hirst  in 2005. It used protein sequence profile - position specific scoring matrices (PSSM) - as input with iterative neural networks to predict psi angle. Real-SPINE2.0 was the first method to predict both phi and psi angles using neural network . ANGLOR used neural networks to predict phi angle and support vector machines to predict psi angle separately .
Some recent methods enhanced or integrated standard machine learning methods such as neural networks and support vector machines (SVM) to improve torsion angle prediction. Real-SPINE3.0 used a guided-learning mechanism to training a two-layer neural network to reduce Mean Absolute Error to 22° for phi angle and 36° for psi angle . TANGLE used a two-level SVM based regression approach to make prediction . SPINE X  and SPINE XI  combined discrete and continuous real-value prediction of torsion angle with multi-step neural network learning, which yielded a MAE of 35° and 33.4° for phi and psi angles, respectively. A comprehensive study has shown that SPINE X performs better than ANGLOR and TANGLE, especially on psi angle prediction .
In recent years, deep learning methods that overcome some limitations of traditional artificial neural networks have been successfully applied to predict local and non-local structural properties of proteins [9–12]. SPIDER2 that used an iterative deep learning method further reduced the MAE of phi and psi angle prediction . However, most existing methods for torsion angle prediction are restricted to learning structural properties from local residue information in sliding windows. A recent method tried to explore the long-range non-local interaction among residues by utilizing bidirectional neural networks , which has shown that non-local contact information of residues can significantly improve the torsion angle predictions.
In this study, we developed four deep learning methods, including Deep Neural Network, Deep Recurrent Neural Network, Deep Restricted Boltzmann Machines, and Deep Recurrent Restricted Boltzmann Machines to predict torsion angles. To improve the prediction accuracy, various combinations of different input features including two novel features (predicted contact number and error distribution of torsion angles of sequence fragments) are systematically examined and compared. We also compared our methods with two other torsion angle predictors: SPIDER2  and TANGLE . Our main contributions of this work include: (1) introducing two novel features that are useful for protein torsion angle prediction; and (2) developing and evaluating different deep learning architectures for torsion angle prediction.
In order to objectively compare our four methods and previous methods, we created a new dataset that has no overlap with the datasets used to training the existing methods in the literature. We obtained protein sequences released between June 2015 and March 2016 from Protein Data Bank (PDB)  and removed sequences whose length is out of range [30,500]. We then removed redundant sequences to make sure the pairwise sequence identity between two sequences is less than 25%, resulting in a dataset consisting of 1652 protein sequences. We randomly selected 100 proteins from this dataset to estimate the distribution of errors between the torsion angles predicted from sequence fragments generated by FRAGSION  and true torsion angles for each of 20 residue types (see section Input features for details). From the remaining 1552 protein sequences, we randomly chose 232 sequences as test dataset, and the rest as training dataset. The training dataset has 1320 sequences.
In order to further assess the performance of the different methods, we selected 11 free-modeling targets in the most recent CASP12 as an independent test dataset, whose native structure are available for torsion angle evaluation (see Additional file 1: Table S1 for the list of CASP12 targets and their length).
Figure 1 illustrates the general flowchart of our deep learning approach for torsion angle prediction. In our methods, seven different types of features represent each residue in a protein. The seven features include: physicochemical properties, protein position specific scoring matrix, solvent accessibility, protein secondary structure, protein disorder, contact number, and the error distribution of torsion angles predicted from sequence fragments. The first five features have been commonly used in various protein prediction problems such as secondary structure prediction and torsion angle prediction before. But the last two features are two novel features used for torsion angle prediction for the first time. The details of these features are described as follows.
The input features were normalized to the range of 0 to 1. The experimental values of phi and psi torsion angles of each protein were calculated by the DSSP program , which are the target output. There are four output nodes to predict the sine and cosine of the phi and psi angles, i.e. sin(),cos(),sin(ψ),cos(ψ), respectively. Sine and cosine were employed to remove the effect of angle periodicity during training. Predicted sine and cosine values of the two angles can be readily converted back to angles by using the equation: =tan −1[sin()/cos()].
Since its nearby residues could influence the torsion angle of a residue, a sliding window approach was used to extract the features for each residue. We combine all of the features in the window to form a feature vector for any residue i. For example, if window size w is 15, letting k = (w-1)/2 = 7, we combine the features: F(i-k), F (i-k + 1), ..., F(i), ..., F(i + k-1), F(i + k) into one feature vector for residue i. In the past, the selection of a suitable window size was largely carried out in a heuristic way. ANGLOR  chose a window size of 21 residues, and SPIDER2  chose a window size of 17 residues, while TANGLE  used a window size of 9 and 13 residues for phi and psi separately. In our experiments, we examined the performance of different window sizes ranging from 3 to 17, and then chose an optimal window size for each method based on 5-fold cross validation on the training data.
Deep learning, a new set of machine learning algorithms closely related to artificial neural networks [4, 9, 10, 29–33], has achieved the state-of-the-art performance in many problems, and is getting more and more popular in bioinformatics [9–12]. Here, we designed 4 types of deep learning architectures for torsion angle prediction. Our deep learning architectures include deep feed-forward neural network, deep recurrent neural network, deep belief network in which the parameters are pre-trained by restricted Boltzmann machine and deep recurrent RBM network where the RBM is trained to initialize the parameters in recurrent neural network. The four deep learning architectures are visualized in Fig. 3. The network consists of an input layer, hidden layers and an output layer. Arrows represent connections between layers. In the input layer, the nodes (neurons) represent the features of each residue in a sequence window centered on a target residue for which torsion angles are predicted. All inputs are connected to every node in the adjacent hidden layer. The nodes in a hidden layer are fully connected to the nodes in next layer, and finally the nodes in the last hidden layer are fully connected to the four output nodes, corresponding to sin(),cos(),sin(ψ),cos(ψ). The nodes in the hidden layers are activated by the sigmoidal function. The four different deep learning architectures used in this study are described in details below.
The deep neural network is a standard multi-layer feed-forward network, consisting of one visible “input” layer, multiple intermediate “hidden” layers, and one predicted “output” layer , as visualized in Fig. Fig.3a.3a. The nodes in each layer are fully connected to the nodes in its adjacent layers, and the network propagates the inputs from first visible layer to the last output layer in a forward manner. For each node in the hidden layers, the input is calculated as the values of nodes in the previous layer multiplied by weight matrix, which is the weighted sum of the previous layer and is adjusted by the bias offset. The non-linear sigmoid activation function is applied to calculate the output of a node from its input, which is defined as:
where the , and are the input, output and bias of i th node in the n th layer, respectively, and (W,b) is the weight and bias. The network sums up all the non-linear outputs of nodes from one layer and propagates to next hidden layer until reaching final output layer. The linear function is applied to the nodes in the output layer to generate predicted real-value torsion angles.
The forward pass in the neural network generally is followed with the backward pass that propagates the errors between the true and predicted torsion angles back to lower-level layers through the network, and updates the weights and biases according to the partial derivative of the error with respect to them (i.e. gradient) to minimize the prediction error (or energy) . In this study, the energy (loss) function is defined as the least square error between predicted angles and true angles:
The gradients of all weights and biases are inferred from back-propagation. That is, given a network with one hidden layer, the parameters in the network can be updated as:
Where and are the weight and bias in the first layer, connecting node i in input layer to node j in hidden layer, and and are the weight and bias in the second layer. S i is the data of node i in the hidden layer, and are the predicted and true torsion angles. All the parameters are updated by Sum-of-Functions-Optimizer (SFO) optimization method .
Traditional neural network starts by randomly initializing the weights of networks, which are optimized by the back-propagation over all data. Training deep neural networks in this way suffers the problem of gradient vanishing or exploding during back-propagation in deep networks, and slow convergence with randomly initialized weights to poor local optima . Unsupervised pre-training methods have been developed to address this limitation, such as pre-training with denoising auto-encoders , or pre-training with restricted Boltzmann machines (RBMs) , which provide good initialization of parameters in network that speed up and enhance training of very deep networks. In this study, we applied our in-house Deep Belief Network toolbox , a deep network with stacked restricted Boltzmann machines (RBM), to torsion angle prediction problem, as visualized in Fig. Fig.3c3c.
A RBM is a generative model that can model the probability distribution over the binary data or real-valued continuous data [39–41]. A RBM is a two-layer network, consisting of one visible layer and one hidden layer, which represents the distribution of input data over all possible hidden units P(v)=∑P hH(v,h). The objective of training a RBM is to adjust the weights of RBM in order to maximize the likelihood of the data - P(v). The training of RBM is completely energy-guided based on the joint probability of all visible and hidden nodes, which is described by the following equation:
where the v i and h j denote the value of i th visible node and j th hidden node, b i and c j are the bias of i th visible unit and j th hidden node, and w ij is the weight connecting the i th visible node and j th hidden node.
To train the RBM, the parameters <W,b,c > are updated by the gradient of the negative log-likelihood of the data with respect to the parameters, which is given by:
The gradient of each parameters were further approximated and calculated by the contrastive divergence (CD) algorithm , which has shown fast convergence within few iterations of Gibbs sampling. In our experiment, the parameters are updated as:
where one step of Gibbs sampling (k = 1) is chosen to train the RBM in our method. More details of training RBM are described in .
Multiple RBMs stacked in our deep restricted Boltzmann machine are trained in a stepwise fashion, in which the hidden data of a trained RBM is fed as visible input data to next RBM. This process is repeated multiple times to transform the original input data into multiple non-linear representations denoted by hidden layers. A standard neural network with a linear regression output node is added at the top of the last hidden layer of multiple RBMs to predict torsion angles. The entire deep restricted Boltzmann machine consisting of the input layer, hidden layers, and the output layer is fine-tuned by traditional back-propagation of the errors between predicted output and true output to adjust the parameters, as described in the section "Deep neural network" of training a standard neural network.
Recurrent neural network is one generalization of traditional feed-forward neural network, which is developed to handle sequential data. Recurrent neural network has achieved good performance on numerous bioinformatics problems, such as secondary structure prediction [20, 43]. Different from standard neural network that uses one sliding fixed-size window, recurrent neural network can recognize patterns in sequences of variable lengths. The sliding window approach can only learn the short-range dependency of residues within the window, and the inputs from different windows are independent of each other. Our deep recurrent neural network calculates the output at a specific position (or time) not only from all the inputs at the position (or a fixed-size window centered at the position), but also outputs of the previous position (or time), as shown in Fig. Fig.3b.3b. For a simple network with one hidden layer, the calculation can be described by the following equation:
where the , and is the input, output and bias of i th node for residue t in the first hidden layer, respectively, the is the i th feature for residue t, U ik is the weight connecting the output j in the hidden layer for residue t-1 with the node i in the hidden layer for residue t. And the output in the output layer is calculated by a linear activation function as in Eq. 1. The weights of recurrent network can be tuned by back-propagation through time (BPTT) . The SFO algorithm  was used to adjust the parameters (weights).
Similar as traditional neural network, recurrent neural network may suffer the problem of vanishing gradient during training . For example, a state-of-art method bidirectional recurrent neural network for protein secondary structure prediction can only capture long term dependency up to 15 amino acids from two directions . Inspired by the pre-training method applied in deep belief network [39, 46] for mitigating the problem of vanishing gradient, we integrate the restricted Boltzmann machine with recurrent neural network to design a Deep Recurrent Restricted Boltzmann Machine for torsion angle prediction.
In DReRBM, the energy function at residue t is adjusted to include the output of hidden nodes at residue t – 1. The overall energy and probability model is described as the following equation:
where the v i and h j are the value of i th visible node and j th hidden node, b i and c j are the bias of i th visible node and j th hidden node, w ij is the weight connecting the i th visible node and j th hidden node, V ik is the weight connecting the i th visible node at time-stamp (t) with k th hidden node at time-stamp (t-1), and U jk is the weight connecting the j th hidden node at time-stamp (t) with k th hidden node at time-stamp (t-1). In our architecture, each time-stamp represents a residue in a different position. In this energy function, we assume the dependency effects between two consecutive time-stamps is applied on the bias of both visible nodes and hidden nodes so that pre-training by RBM might better capture the correlation between inputs. The gradient of parameters can be calculated in Gibbs sampling as:
We used Mean Absolute Error (MAE) to evaluate the prediction of phi and psi angles. The MAE is the average absolute difference between predicted angles (P) and experimentally determined angles (E) for all residues. Here, both P and E are in the range of [−180,180]. A direct subtraction of the two values may result in an artificial MAE >180. To rule out the artificial effect, we make a transformation of the predicted angles before comparing them as follows.
Where P ′ is the original value of the predicted torsion angles. Paired t-test are also applied to check the statistical significance between the performances of different methods.
We evaluated both normal deep learning models (DNN and DRBM) and deep recurrent learning models (DRNN and DReRBM). We also compared our methods with two other torsion angle prediction methods SPIDER2 and ANGLOR. In the following sections, firstly we assessed the impact of different feature combinations on the performance of DRBM. Then we identified the optimal window size for each of the four deep learning models and tested different memory sizes for two recurrent deep learning models (DRNN and DReRBM). Finally, we compared and analyzed the results of the six methods including our four in-house deep learning methods, SPIDER2, and ANGLOR.
We used 7 different features including two new features (predicted contact number and the error distribution of torsion angles predicted from sequence fragments) with our deep learning methods. Table 1 compares the performance of different feature combinations with DRBM on the test dataset. The DRBM was trained on the training data with window size of 17 and the architecture of three hidden layers of 500_200_50 (i.e. 500 nodes in the first hidden layer, 200 nodes in the second hidden layer, and 50 nodes in the third hidden layer). Among all the single features (Part 1 of Table Table1),1), PSSM has the MAE of 23.28 and 35.12 for phi and psi angles, which has the best “avg” value. And our two new features performed better than the three common features (physicochemical, solvent accessibility and disorder). We tested two kinds of secondary structure features (3-state secondary structure prediction and 8-state secondary structure prediction) and two kinds of contacts number features (real-value contact number prediction and 15-class contact number prediction). The 8-state secondary structure feature achieved better performance than 3-state secondary structure, and the 15-class contact number probability prediction was more effective than the predicted real-value contacts number. To avoid redundancy in the features, we chose to use 8-class secondary structure feature and 15-class contact number probability feature with all our deep learning methods in this study.
Part 2 of Table Table11 shows the results of combining PSSM with every other feature. Except for solvent accessibility, every other feature combination improved the prediction accuracy than using PSSM alone, suggesting that directly adding each of five other features on top of PSSM is beneficial. For instance, combining PSSM with the error distribution of fragment-based angles has MAE of 22.19 and 34.29 for phi and psi angles, and combining predicted contacts number with PSSM has MAE of 22.41 and 33.14 for phi and psi angles, which is better than MAE of 23.28 and 35.12 of using PSSM alone.
We continued to add one additional feature into the best set of feature of the previous round progressively to find good combination of 3 features, 4 features, 5 features, and all the 7 features (see Parts 3–7 of Table Table1).1). We found that this forward feature selection can give us very good or even best feature combinations for a specific feature number. In view of the whole results from the Table Table1,1, we found that, if every time we choose the best feature combination as basis to combine more features, most of the time we can get better result in the next step. The best combination for each feature number tends to include either contact number feature or the error distribution of fragment-based angles, indicating that the two new features can improve the prediction accuracy.
We consider PSSM, solvent accessibility, secondary structure, protein disorder and 7 physicochemical properties as five standard features. In order to evaluate the performance improvement induced by adding the two novel features, we performed the experiments with different features sets with or without either one or both of the two novel features as follows: standard features (Feature set 1), standard features plus contacts number (Feature set 2), standard features plus fragsion (Feature set 3), and standard features plus contacts number and fragsion (Feature set 4). These experiments was conducted using the DRBM model. Table 2 shows that including either contact number or fragsion can slightly improve the prediction of phi and psi angle, while including both features can further improve the prediction accuracy, especially for the psi angle, whose prediction accuracy is improved by 10.1% if two novel features are added.
A sliding window approach was used often to extract the local sequence profile for each residue in a sequence, which was used as input to our four deep learning methods. We tested window size ranging from 1 to 17 with our four deep learning architectures having 3 hidden layers consisting of 500, 200 and 50 nodes respectively.
Table 3 reports the accuracy of phi and psi angle predictions of four different methods with different window size. It is shown that the accuracy increases as the window size increases at the beginning, reaches the highest at a certain size, and then starts to decrease as the size continues to increase. This is because increasing window size at the beginning may incorporate more information than noise, leading to better performance, but after a specific threshold, increasing window size may include more remote information that contains more noise than signal, leading to worse performance. According to Table Table3,3, the best local window size for DRBM is 7, which has a MAE of 20.84 and 28.85 for phi and psi angles respectively. Similarly, the best window size for DRNN is also 7. The best window size for DReRBM is 3. For DNN, the best window size is 11, which has a MAE of 21.04 and 29.06 for phi and psi angles. Compared with normal deep networks, deep recurrent networks can work well with smaller window sizes because they can use the output information from previous positions as input recursively. Larger window size generally performs better than window size equals to “1” suggests that local context information is important for torsion angle prediction.
Different with traditional deep networks, deep recurrent networks assume that the output of current position (or time) depends on that of the previous positions. Therefore, deep recurrent networks have a “memory”, which captures information about what has been calculated so far. In theory, recurrent networks can make use of the information from a long previous sequence, but in practice they are mostly limited to looking back in a few steps due to vanishing gradients during back-propagation or decreasing signal to noise ratio. In this work, we tested DRNN and DReRBM on five different memory lengths (i.e. 5,10,15,20,25) and the results are shown in Table 4. For DRNN, smaller memory lengths (5, 10, 15) yield better performance than larger memory lengths (20, 25), but DReRBM obtained comparable results use different memory lengths. This indicates that DReRBM can use longer memory length than DRNN. In this study, since smaller memory lengths perform similarly or better, we chose to use the memory length of 5 to train both DRNN and DReRBM. Compared to the traditional deep feed-forward networks that make predictions based only on the information in a fixed-size local window, DRNN and DReRBM predict torsion angles using the information from the entire input sequence by propagating information through recurrent networks, which leads to the improved performance of the recurrent methods (DRNN and DReRBM) over the deep feed-forward network based methods (DNN and DRBM).
We performed 5-fold cross validation of our four methods on the training data set and chose the appropriate features combination, window size, and/or memory lengths for each of our deep learning method. For the non-recurrent models DNN and DRBM, we assessed the effect of different numbers of hidden layers. As shown in Table 5, three hidden layers can achieve similar performance as 5 hidden layers, which is better than other numbers of hidden layers. Therefore, we finally used a simpler three-hidden-layer architecture (500_200_50) consisting of 500, 200 and 50 nodes for each hidden layer, respectively. After these methods with selected parameters were trained on the training dataset, they were blindly tested on the test dataset consisting of 232 proteins.
Table 6 reports the MAE on the test data for DNN, DRBM, DRNN, DReRBM, SPIDER2 and ANGLOR. DReRBM has the lowest MAE of 20.49 and 29.06 for phi and psi angles, which is better than 20.88 and 31.64 of SPIDER2 and much better than 24.72 and 44.42 of ANGLOR. Overall, our four methods achieved the performance of phi angle prediction that was comparable to a state-of-the-art method SPIDER2 and made notable improvements on the prediction of psi angles. Our experiment also shows that the two deep recurrent networks (DRNN and DReRBM) performed better than the two standard deep networks (DNN and DRBM).
Table 7 shows the accuracy of our four deep learning models and SPIDER2 on 11 CASP12 free modeling targets. For the phi angle, the MAE of four deep learning methods and SPIDER2 are comparable to each other. For the psi angle prediction, the recurrent methods produce lower MAE than the other methods. Compared to SPIDER2, DReRBM can achieve 4.4% improvement on MAE of the psi angle. The p-values of paired t-test between each pair of methods are shown in Table 8. For each method, the MAE value for each residue on phi and psi angles were calculated, and paired t-test were applied to the results of different methods. Table Table88 shows that the DReRBM is significantly more accurate than SPIDER2 on both phi and psi angle. Especially, the more significant improvement on psi angle is achieved by DReRBM, which is consistent with the results in Table Table6.6. In terms of the running time on the CASP12 dataset, SPIDER2 takes about 37 s and our methods take about 863 s to make prediction on average. The relatively longer running time for our method is because of the time needed by third-party tools to generate input features. Once the features are generated, our methods can make predictions in seconds.
In this study, we developed four different deep learning methods for protein torsion angle prediction. We tested various feature combinations, window sizes, memory lengths, and numbers of hidden nodes to study their impact on the prediction accuracy. Our experiment shows that the two new features (predicted contact number and error distribution of fragment-based torsion angles) are useful for torsion angle prediction, and recurrent deep learning architectures perform better than feed-forward deep learning architectures. Finally, we demonstrated that deep learning methods achieved the performance better than or comparable to the state of the art methods for torsion angle prediction on both independent datasets and CASP12 targets.
The work was partially supported by an NIH grant (R01GM093123) to JC, and partially supported by National Science Foundation of China under the grant number 61170125 and the Research Innovation Program for College Graduates of Jiangsu Province (Grant No. KYZZ15_0332). The funding agencies do not play any role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Project name: Deep learning methods for torsion angle prediction (DNTor1.0).
Project home page: http://sysbio.rnet.missouri.edu/multicom_toolbox/tools.html
Operating system(s): Linux.
Programming language: Python.
Additional file 1: Table S1.(12K, docx) The 11 free-modeling targets in the most recent CASP12 (DOCX 12 kb)
The 11 free-modeling targets in the most recent CASP12 (DOCX 12 kb)
HL, JH, JC designed the study. HL, JH implemented the method and conducted the experiment. HL, JH, BA, JC and QL analyzed the results. HL, JH, JC wrote the paper. All of the authors read, edited, and approved the final manuscript.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
The online version of this article (10.1186/s12859-017-1834-2) contains supplementary material, which is available to authorized users.
Haiou Li, Email: nc.ude.adus.uts@50072044102.
Jie Hou, Email: ude.iruossim.liam@3x7hj.
Badri Adhikari, Email: ude.lsmu@birakihda.
Qiang Lyu, Email: nc.ude.adus@gnaiq.
Jianlin Cheng, Email: ude.iruossim@ijgnehc.