In its original implementation, the TALOS (Torsion Angle Likeliness Obtained from Shift and Sequence Similarity) program was based on a small, 20-protein database for which complete or nearly complete heteronuclear resonance assignments and high resolution X-ray coordinates were available. In validation trials, the original program reported consistent predictions of
and ψ for on average 65% of the residues. Subsequent expansion of the database to 78 proteins, implemented in post-2003 releases of the program, yield consistent predictions of
and ψ for on average 72% of the protein residues, with an error rate decreased to below 3% (unpublished results). Although at a first glance these statistics appear quite encouraging, the vast majority of the predictions pertain to residues located in elements of well-defined secondary structure, where conventional NMR restraints often already define local structure quite well. The 28% of residues for which TALOS obtains ambiguous results are mostly located in regions of irregular structure, including loops and turns. We here report an extension of the original program, named TALOS+, which extends the fraction of consistent predictions to 88%, i.e., which cuts in half the fraction of residues unpredictable by TALOS, while at the same time slightly lowering the error rate to below 2.5 percent.
TALOS+ is largely based on the same concept as the original TALOS program, and now exploits a larger database of 200 proteins originally taken from the BMRB (Markley et al. 2008
) for use in the chemical shift prediction program SPARTA (Shen and Bax 2007
), but more importantly it includes a neural network component whose output is used as an additional term in the conventional TALOS database search. The neural network component of the program relies on a well established computational framework that optimizes the relation between a large number or input variables, such as amino acid types and chemical shifts, and any given output parameter. The latter, in our application, can be the secondary structure of any given amino acid or the area of the Ramachandran map where the residue resides. Importantly, after training on a database for which the input and output parameters are known, the neural network not only identifies the most likely answer when applied to datasets where the output is unknown, but it also reports a reliable estimate of the likelihood that any of the possible output values is applicable. Neural network algorithms are widely used in information processing, and have found numerous applications in NMR data analysis too. These include work on facilitating resonance assignment (Hare and Prestegard 1994
; Huang et al. 1997
; Pons and Delsuc 1999
), identification of secondary structure in the presence and absence of NMR chemical shift data (Andreassen et al. 1990
; Choy et al. 1997
; Hung and Samudrala 2003
), and approaches that permit prediction of chemical shifts based on known protein structure (Meiler 2003
; Moon and Case 2007
). Here, the inverse of this latter application is used to identify the approximate region of the Ramachandran map where a given residue resides, based on the chemical shifts and residue type of the residue in question, as well as those of its immediate neighbors in the protein sequence.
In order to expand the program’s ability to predict backbone torsion angles, TALOS+ now also considers the frequently encountered cases where residue assignments are lacking. Although the fraction of such residues for which consistent predictions can be made tends to be significantly lower, the reliability of such predictions remains high. For convenience, and in order to prevent assignment of backbone torsion angles to regions that are dynamically disordered, TALOS+ also reports an estimated backbone order parameter derived from the chemical shifts in a way recently described by (Berjanskii and Wishart 2008