|Home | About | Journals | Submit | Contact Us | Français|
NMR chemical shifts in proteins depend strongly on local structure. The program TALOS establishes an empirical relation between 13C, 15N and 1H chemical shifts and backbone torsion angles and ψ (G. Cornilescu et al. J. Biomol. NMR. 13, 289–302, 1999). Extension of the original 20-protein database to 200 proteins increased the fraction of residues for which backbone angles could be predicted from 65 to 74%, while reducing the error rate from 3 to 2.5 percent. Addition of a two-layer neural network filter to the database fragment selection process forms the basis for a new program, TALOS+, which further enhances the prediction rate to 88.5%, without increasing the error rate. Excluding the 2.5% of residues for which TALOS makes predictions that strongly differ from those observed in the crystalline state, the accuracy of predicted and ψ angles, equals ±13°. Large discrepancies between predictions and crystal structures are primarily limited to loop regions, and for the few cases where multiple X-ray structures are available such residues are often found in different states in the different structures. The TALOS+ output includes predictions for individual residues with missing chemical shifts, and the neural network component of the program also predicts secondary structure with good accuracy.
Chemical shifts are well recognized as important reporters on protein structure. Strong correlations between local structure and chemical shifts have been established by quantum chemistry methods, including both density functional theory (DFT) and Hartree Fock calculations (Xu and Case 2001; Czinki and Csaszar 2007; Moon and Case 2007; Vila et al. 2007; Villegas et al. 2007; Vila et al. 2008), and by alternate computational (Haigh and Mallion 1979; Williamson and Asakura 1993; Case 1995) or fully empirical methods (Wagner et al. 1983; Saito 1986; Spera and Bax 1991; Wishart et al. 1991; Williamson and Asakura 1993; Williamson et al. 1995; Asakura et al. 1997; Ando et al. 1998; Cornilescu et al. 1999; Castellani et al. 2003; Neal et al. 2003; Neal et al. 2006; Shen and Bax 2007). The need for streamlining the protein structure determination process has been well recognized (Billeter et al. 2008), and it is clear that recent chemical shift based approaches offer an attractive route to expedite this process, at least for smaller proteins (Cavalli et al. 2007; Shen et al. 2008; Wishart et al. 2008; Shen et al. 2009). At the same time, conventional structure determination efforts frequently take advantage of the empirical relation between chemical shifts and the backbone torsion angles and ψ, most commonly predicted by the program TALOS (Cornilescu et al. 1999), as a complement to conventional NOE distance restraints or to internuclear distances obtained by solid-state NMR.
In its original implementation, the TALOS (Torsion Angle Likeliness Obtained from Shift and Sequence Similarity) program was based on a small, 20-protein database for which complete or nearly complete heteronuclear resonance assignments and high resolution X-ray coordinates were available. In validation trials, the original program reported consistent predictions of and ψ for on average 65% of the residues. Subsequent expansion of the database to 78 proteins, implemented in post-2003 releases of the program, yield consistent predictions of and ψ for on average 72% of the protein residues, with an error rate decreased to below 3% (unpublished results). Although at a first glance these statistics appear quite encouraging, the vast majority of the predictions pertain to residues located in elements of well-defined secondary structure, where conventional NMR restraints often already define local structure quite well. The 28% of residues for which TALOS obtains ambiguous results are mostly located in regions of irregular structure, including loops and turns. We here report an extension of the original program, named TALOS+, which extends the fraction of consistent predictions to 88%, i.e., which cuts in half the fraction of residues unpredictable by TALOS, while at the same time slightly lowering the error rate to below 2.5 percent.
TALOS+ is largely based on the same concept as the original TALOS program, and now exploits a larger database of 200 proteins originally taken from the BMRB (Markley et al. 2008) for use in the chemical shift prediction program SPARTA (Shen and Bax 2007), but more importantly it includes a neural network component whose output is used as an additional term in the conventional TALOS database search. The neural network component of the program relies on a well established computational framework that optimizes the relation between a large number or input variables, such as amino acid types and chemical shifts, and any given output parameter. The latter, in our application, can be the secondary structure of any given amino acid or the area of the Ramachandran map where the residue resides. Importantly, after training on a database for which the input and output parameters are known, the neural network not only identifies the most likely answer when applied to datasets where the output is unknown, but it also reports a reliable estimate of the likelihood that any of the possible output values is applicable. Neural network algorithms are widely used in information processing, and have found numerous applications in NMR data analysis too. These include work on facilitating resonance assignment (Hare and Prestegard 1994; Huang et al. 1997; Pons and Delsuc 1999), identification of secondary structure in the presence and absence of NMR chemical shift data (Andreassen et al. 1990; Choy et al. 1997; Hung and Samudrala 2003), and approaches that permit prediction of chemical shifts based on known protein structure (Meiler 2003; Moon and Case 2007). Here, the inverse of this latter application is used to identify the approximate region of the Ramachandran map where a given residue resides, based on the chemical shifts and residue type of the residue in question, as well as those of its immediate neighbors in the protein sequence.
In order to expand the program’s ability to predict backbone torsion angles, TALOS+ now also considers the frequently encountered cases where residue assignments are lacking. Although the fraction of such residues for which consistent predictions can be made tends to be significantly lower, the reliability of such predictions remains high. For convenience, and in order to prevent assignment of backbone torsion angles to regions that are dynamically disordered, TALOS+ also reports an estimated backbone order parameter derived from the chemical shifts in a way recently described by (Berjanskii and Wishart 2008).
The original TALOS protein structure database of 20 proteins (Cornilescu et al. 1999) in recent years has been upgraded to include 78 proteins, and this database is used in post-2003 release versions of the program. The current work utilizes the further expanded database of 200 proteins, originally developed for the SPARTA chemical shift prediction program (Shen and Bax 2007). This database, extracted from the BMRB, contains proteins with nearly complete backbone NMR chemical shifts (δ15N, δ13C′, δ13Cα, δ13Cβ, δ1Hα and δ1HN) as well as PDB coordinates from high-resolution X-ray structures. Details regarding the preparation of the database, including calibration of reference frequencies, etc, have been described previously (Shen and Bax 2007). For the current application, if the database entry contains two or less assigned chemical shifts for any given residue, these chemical shift entries are removed. For residues with incomplete sets of chemical shifts (less than six for non-Gly residues, less than five for Gly), a standard TALOS database search (Cornilescu et al. 1999) was performed to find the average (secondary) chemical shifts for the atoms of the center residues of the best 10 matched triplets. These predicted secondary chemical shifts were then assigned to the atom(s) with missing experimental chemical shifts of this residue. Therefore, after this adjustment the database contains residues with either complete 15N, 13C′, 13Cα, 13Cβ, 1Hα and 1HN chemical shifts, or no chemical shift values at all.
In order to study relations between NMR chemical shifts and backbone torsion angles, a three-state backbone “/ψ distribution” code is assigned to each residue: [1 0 0] (Alpha or “A”; −160<<0 and −70< ψ<60), [0 0 1] (Left-handed helix, here referred to as positive- or “P”; 0<<160 and −60< ψ<95), and [0 1 0] (Beta or “B”, comprising all others, including some residues with positive angles outside the P region). These regions are depicted in Figure 1A. For each residue in the database, a field was added to indicate the DSSP secondary structure (Kabsch and Sander 1983), determined from the X-ray coordinates, and further regrouped into three states: H (Helix; DSSP classification of H or G), E (Extended strand; E or B) and L (Loop; comprising DSSP classifications I, S, T and C).
TALOS+ uses a two-level feed-forward multilayer artificial neural network (ANN) to predict the location in /ψ space, or the secondary structure, based on a residue’s NMR chemical shifts and amino acid type, and those of its adjacent residues.
For the first level neural network (Figure 2), the input signals to the first layer consist of tri-peptide parameter sets derived from the above described database. Each tripeptide set has 78 nodes, representing six secondary chemical shift values and twenty amino acid type similarity scores for each residue. In the hidden layer of the network, where each node receives the weighted sum of the input layer nodes as a signal, 20 such nodes (or hidden neurons) are used. The output of a hidden layer node is obtained through a nodal transformation function; here a standard sigmoid function is used (see eq 1).
For the purpose of predicting the torsion angle distribution from NMR chemical shifts, the above described three-state /ψ torsion angle distribution of the center residue of each tri-peptide in the database is used as the target of the first level network: [1 0 0] for alpha (A), [0 1 0] for beta (B), and [0 0 1] for positive- (P). Each output value has one node with a linear activation function (f2(x) = x, eq 1). This procedure is schematically shown in Supplementary Information Figure S1. The empirical relationship between the 3-state /ψ torsion angle distribution and NMR chemical shift data received by the first level network is given by
with f1(x) = 1/(1+e−x), and f2(x) = x. X1×78 is the input data vector consisting of 78 elements; W(1) and b(1) are the weight matrix and bias, respectively, for the connection between the nodes in the input and the hidden layer; W(2) and b(2) are the weight matrix and bias, for the connection between the nodes in the hidden and output layer; P1×3 is the training target or the output vector.
The second level of neural network, as implemented here, is used to smoothen the prediction by accounting for commonly observed patterns in proteins, and follows its use in the well-known sequence-based secondary structure prediction programs PHD (Rost and Sander 1993) and PsiPred (Jones 1999). The two-level artificial neural network shown in Figure 1 uses the input information from three sequential residues for the first level, and the input from five sequential residues for the second level, and will be referred to as a 3–5 ANN model. A more detailed discussion of the slightly different ANN models used in this study is presented below.
For all ANN models used, the input layer for the second level uses the parameter set of the three-state /ψ torsion angle distribution predicted by the first level of the network for each available tri-peptide in the database, i.e., each set has 15 nodes when the input of five sequential residues is used. The hidden layer contains 6 nodes, and the three-state /ψ torsion angle distribution of the center residue of the corresponding pentapeptide in the database is used in the output layer and as the target of the neural network. The empirical formula of the neural network is similar to eq 1:
where X1×15 is the input vector containing the 15 nodes; the definitions of weights, biases, and activation functions are the same as those in eq 1. Eqs 1 and 2 of this two-level network, with the optimized weights and biases obtained from the training dataset, are then used to predict the 3-state /ψ torsion angle distribution for residues in any protein of unknown structure. The eq 2 network output vector, P1×3, represents the probabilities for the query residue to be within each of the three states: alpha, beta and positive-.
The final “predicted state” of a given residue is assigned to the state with the largest probability. For later analysis of the prediction performance of the network, the confidence of a given prediction is defined as the difference between the probabilities of the two most favored predicted states.
Several slight modifications of the above two-level neural network have been used also, to allow application for cases where missing chemical shift data do not permit use of the above 3–5 ANN model.
In order to study the relation between the three-state secondary structure (helix or H, extended strand, or E, and loop, L) and NMR chemical shifts, the same two-level neural network architectures are used, in which the three-state secondary structure classification of the center residue of the corresponding penta- or tri-peptide in the database is used in the output layer and as the target for both levels of the neural network.
The weights and bias terms were determined by training of the network, using the chemical shift and sequence information of the 200-protein database, described above. To prevent over-training, a three-fold training and validation procedure was performed for each above mentioned neural network model by dividing the input training dataset into three input subsets followed by separate training of the corresponding neural networks. For each of these three network optimizations, one input subset was excluded from the training dataset but then used to evaluate the performance of the neural network during the training. This subset, referred as the validation dataset, was not used to calculate the weight changes in this network. Training of the network was terminated when the performance of the network on the validation dataset, represented by the mean squared errors (MSE) between the predicted values and targets, began to degrade
In addition to the above three-fold training and validation, a second validation procedure was performed for a set of 13 additional proteins, which have (1) (nearly) complete chemical shifts, (2) a good quality reference structure, (3) a wide range of folds and (4) no homologous protein (≥30% sequence identity) in the 200-protein database. The neural network prediction used for these 13 proteins was obtained by averaging over the outputs from the three networks separately trained above.
To inspect the network prediction performance of a given state for a protein or dataset, an accuracy score Q is defined (Rost and Sander 1993):
which describes for state i the ratio of residues correctly predicted to be in state i ( ) relative to those experimentally observed to be in state i ( ). The overall network prediction performance for all three states in a protein or dataset can be measured by a Q3 score:
Similarly, the prediction reliability is evaluated by a true-positive ratio:
which describes for state i the ratio of residues correctly predicted to be in state i ( ) relative to those predicted to be in state i ( ). In our TALOS+ application of neural network prediction, the weight assigned to such a prediction depends on the confidence reported by the neural network. We therefore also define the values of eqs 3–5 for results reported at a confidence level >c%, and refer to these as Qc(i), Q3c(i), and TPc(i).
The predicted /ψ torsion angle classification, obtained by using the above neural network approach, was used as an additional input when carrying out the regular TALOS backbone torsion angle predictions (Cornilescu et al. 1999). This neural network supplemented software package is named TALOS+.
For a given query tri-peptide [i−1, i, i+1], the original TALOS program searches its database for the ten tri-peptides [j−1, j, j+1]k (k=1,…,10) best-matched in terms of backbone chemical shift and residue type. When at least 9 out of the 10 [j/ψj]k cluster in the same region of the Ramachandran map, the original TALOS program made a /ψ prediction for residue i from the average values of the cluster. TALOS+ uses a modified similarity score, accounting for the output of the neural network /ψ distribution predictions:
where the terms accounting for the difference in residue type, ΔRestype, and the difference in secondary chemical shift (ΔδXi+n − ΔδXj+n) of nucleus X, including their weighting coefficients kn0 and knX, are identical to those of the standard TALOS similarity score (eq. 1, Cornilescu et al. 1999), X = 15N, 1HN, 1Hα, 13Cα, 13Cβ and 13C′. The new terms account for the difference of the /ψ states predicted for query residue i and observed for database residue j:
where Pi (sj) is the predicted probability for query residue i to be in state sj (the observed state of the corresponding residue of the database tri-peptide). The weighting factors for each of the terms are given by kns = 0.2, 1, 0.2 for n=−1,0,1. A confidence threshold value T = 0.8 is used in the default parameterization of the program; when the neural network prediction has a confidence below this value, a less steep weighting factor is used compared to residues whose /ψ state is predicted at high confidence, aimed at eliminating residues with /ψ states that the neural network deems highly unlikely.
With the addition of the neural network component in eq 7, which tends to narrow the distribution of /ψ angles in the top-10 selected triplets considerably, the default setting for accepting a TALOS+ prediction as consistent, or “good” has been changed to cases where the center residues of all 10 selected fragments cluster in the same state, A, B, or P, which requires a confidence level greater than 0.6 by its ANN /ψ prediction; otherwise, such a prediction is designated as “ambiguous”. The TALOS+ database search and prediction procedure is shown schematically in Figure 3. Although not indicated in this figure, the neural network component of the program runs by default in the 3–5 ANN mode, but automatically switches to the 3–3 ANN model when chemical shifts are not available for five sequential residues. Moreover, when the first, center, or last residue in the triplet under consideration lacks chemical shifts, the neural network uses the 3–3 ANN(i−1), 3–3 ANN(i), or 3–3 ANN(i+1) model, respectively. These features are implemented in the TALOS+ program in a fully automated manner and therefore do not require user intervention. Predictions for these cases with partially missing chemical shifts extend the fraction of residues for which /ψ angles can be predicted at only a small cost in accuracy (vide infra). Additional recommendations regarding the use and interpretation of TALOS+ are available as Supporting Information. The TALOS+ database search procedure is performed by a program largely written in C++, which is several orders of magnitude faster than the original tcl script driving the TALOS search, and thereby far offsets the slowdown caused by the larger database employed by TALOS+. On a PC with a single 2.4 GHz CPU, the TALOS+ database search procedure takes ca 15 seconds for a 100-residue protein.
The neural network analysis used by TALOS+ is trained to predict at the highest possible accuracy the /ψ angle state (Alpha, Beta, or Positive-) on the basis of the backbone NMR chemical shifts and residue type of the residue itself and its neighbors in the sequence. The 200-protein database used for training the neural network comprised a total of 23,257 residues, and the subset of 19,894 residues with three or more chemical shifts assigned have been used for training of the neural network models. The /ψ angle distribution of the full set of database residues is shown in Figure 1A; the number of residues in state Alpha, Beta, and Positive- is 11701, 10596 and 960, respectively.
When ignoring the confidence level of the neural network prediction, correct assignment (TP(i); eq 5) of the Alpha, Beta, and Positive- regions is found for 96.6 and 96.3% of the database residues for the 3–5 ANN and 3–3 ANN models, respectively (Table S1). These numbers drop to about 94% when one of the residues in the triplet is lacking chemical shifts (Table S1). Importantly, when limiting the evaluation to residues whose /ψ region can be predicted at a confidence ≥80%, the success rate TP80(i) is much higher, 98.7%, almost independently of the neural network type used (Table S1). However, as expected, the fraction of residues for which a confidence level ≥80% is obtained drops when fewer data are available, from 89% when the 3–5 ANN model can be used, to 81% when the chemical shifts for the residue in question are missing (but shifts for the adjacent residues are available; model 3–3 ANN(i)). When the confidence level threshold is raised to 0.9, the error rate in the neural network output drops to well below 1% (Figure 1B–D). An average TP80(i) score of 99.0% for 13 test proteins which are not part of the 200-protein database used during neural network training (Table S3) is very similar to what is seen for the database itself and confirms that no over-training of the neural network has taken place.
The TALOS+ user interface is very similar to that of the original TALOS program, (Figure 4). New features include a marking on the Ramachandran map of the ANN-predicted probability to find any given residue in the Alpha, Beta, or Positive- region, and two graphs displaying the RCI-derived (Berjanskii and Wishart 2005; Berjanskii and Wishart 2008) order parameter, S2, and the ANN-predicted secondary structure. For the latter, the length of the bars corresponds to probability of a residue to be helix or β-strand. In the sequence display, unambiguous predictions are marked in green, ambiguous results in yellow, and residues predicted to be dynamically disordered are colored in blue. As with the original TALOS program, separate output files containing the details of each prediction are also generated.
Backbone torsion angles were predicted by both the original TALOS and the new TALOS+ programs for all of the 200 database proteins, using the cross-validation “leave-one-out” manner, i.e., for predicting the backbone angles of any given protein that protein was removed from the database prior to the search. Results are summarized in Table 1. The original TALOS method, on average, makes “unambiguous” predictions for about 74% of the residues when applied to our larger database, with 2.48% of the predicted /ψ torsion angles having large errors relative to those observed in the reference X-ray structures. As seen in Table 1, the root-mean-square differences (rmsd) between the predicted and crystallographically observed backbone angles are slightly larger for the angles reported by TALOS+ than by TALOS. However, this small increase results primarily from the fact that TALOS+ includes far more predictions outside regions of regular secondary structure. When restricting the rmsd evaluation to the residues predicted by TALOS, the rmsd obtained by TALOS+ is actually slightly lower (Table 1). With TALOS+, the number of “unambiguous” predictions jumps to 88.5%, while the error rate decreases slightly to 2.46%. More details regarding how well TALOS and TALOS+ compare for different residue types, and for the different proteins in the database is provided in Supplementary Information Figures S2 and S3.
The performance of TALOS+ predictions was further validated for 13 proteins with various folds and absent from the TALOS database (Table 2). These include the small proteins GB3 (Ulmer et al. 2003), DinI (Ramirez et al. 2000), BAF (Cai et al. 1998), and TolR (Parsons et al. 2008), determined at high resolution by NMR with the aid of RDCs, and nine proteins whose NMR assignments and X-ray structures have recently become available (Table 2). The statistics for the TALOS+ predictions on these new proteins are very similar to those observed for the 200 protein database, with 90% of the residues predicted as “unambiguous”, and an error rate below 2.0%.
It is perhaps interesting to note that our reported error rate of the TALOS+ predictions in all likelihood significantly overestimates the true error rate: Many of the “erroneous” predictions occur outside of regions of secondary structure, where the X-ray and solution structures may actually differ from one another. An interesting example in this respect is the protein FluA, for which multiple X-ray structures are available. Comparing the TALOS+ predictions to these structures shows three to seven “errors”, depending on which reference structure is used (Figure S4; Table S4). However, not a single one of these “erroneous” predictions differs consistently with all three X-ray structures, suggesting that the TALOS+ result simply reflects the difference between the solution state of the protein and the various states of these residues observed by X-ray crystallography.
NMR chemical shifts have been widely used to identify the secondary structure elements in proteins (Wishart et al. 1992; Huang et al. 1997; Wang and Jardetzky 2002; Hung and Samudrala 2003). Here, we also evaluate the prediction performance of our neural network for secondary structure prediction, using the same input data as used above for grouping the backbone torsion angles in three regions, and we include the predicted secondary structure as an additional feature of the TALOS+ program.
By training a 3–3 ANN model, evaluation of TALOS+ secondary structure prediction over the 200 protein database, using the cross validation “leave one out” method, yields Q ratios (eq 4) of 94.3%, 88.3% and 82.4% for helix, extended, and loop residues, respectively. The overall Q3 of 88.9% compares favorably with the 82–89% Q3 range reported by the other NMR-based secondary structure prediction programs, perhaps because TALOS+ uses a larger set of backbone chemical shifts per residue than most of the other programs.
Evaluation of the secondary structure prediction efficiency on the set of 13 proteins whose data are not part of the database yields very similar results, again proving that over-training of our neural network was successfully avoided. Details of the secondary structure prediction efficiency of TALOS+ and the popular CSI (Wishart et al. 1992), PSSI (Wang and Jardetzky 2002), and PsiCSI (Hung and Samudrala 2003) programs are presented in Table S3.
TALOS+ offers a significant extension of our ability to predict protein backbone torsion angles from chemical shifts. Compared to the original TALOS program, the fraction of residues whose backbone angles cannot be predicted unambiguously is reduced by more than 50%. The additional residues whose torsion angles now can be predicted reliably are located outside of regions of secondary structure, where typically such restraints are most needed. Considering that backbone chemical shifts are obtained early on during the NMR study of a protein, these results can guide the further data analysis and may prove particularly important for the study of larger proteins, where typically the number of NOE restraints per residue tends to drop significantly. In this respect it is interesting to note that addition of the unambiguous TALOS+ torsion angle predictions for the protein malate synthase G, the largest single chain protein whose structure has been determined by NMR, falls closer to the X-ray structure (2.6 vs 3.3 Å Cα rmsd) when the new TALOS+ restraints are included instead of the TALOS restraints used originally (Tugarinov et al. 2005; Grishaev et al. 2008).
The improvement in performance of TALOS+ over TALOS is primarily the result of its incorporation of the neural network output into the selection of database fragments that most closely match the residues in the query protein. It is conceivable that with further training and refinement, in combination with an even larger database, small additional improvements may be attainable. On the other hand, a significant fraction of the residues whose backbone torsion angles cannot be predicted unambiguously by TALOS+ exhibit high amplitude backbone motions, as evidenced by their RCI-derived order parameters, and often are found at the termini of the protein or in longer loop regions. For such regions, it is unlikely that further improvements to TALOS+ will provide significant enhancements.
We thank Alex Grishaev for carrying out the MSG calculation with the new TALOS+ backbone angle restraints. This work was funded by the Intramural Research Program of the NIDDK, NIH.
The TALOS+ software package can be downloaded from http://spin.niddk.nih.gov/bax/.
Supplementary Material Available
Four tables with details regarding the performance of the neural network performance and TALOS+ performance; four figures detailing the neural network architecture and the performance of TALOS+; a user guide for the TALOS+ program.