Home | About | Journals | Submit | Contact Us | Français |

**|**BMC Bioinformatics**|**v.12(Suppl 1); 2011**|**PMC3044290

Formats

Article sections

- Abstract
- Background
- Methods
- Results
- Discussion
- Conclusions
- Authors' contributions
- Competing interests
- References

Authors

Related links

BMC Bioinformatics. 2011; 12(Suppl 1): S34.

Published online 2011 February 15. doi: 10.1186/1471-2105-12-S1-S34

PMCID: PMC3044290

Igor Chikalov: igor.chikalov/at/kaust.edu.sa; Peggy Yao: peggyyao/at/stanford.edu; Mikhail Moshkov: mikhail.moshkov/at/kaust.edu.sa; Jean-Claude Latombe: latombe/at/cs.stanford.edu

Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011)

Yi-Ping, Phoebe Chen and Kwang-Hyun Cho

The Ninth Asia Pacific Bioinformatics Conference (APBC 2011)

11–14 January 2011

Inchon, Korea

Copyright ©2011 Chikalov et al; licensee BioMed Central Ltd.

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article has been cited by other articles in PMC.

Hydrogen bonds (H-bonds) play a key role in both the formation and stabilization of protein structures. They form and break while a protein deforms, for instance during the transition from a non-functional to a functional state. The intrinsic strength of an individual H-bond has been studied from an energetic viewpoint, but energy alone may not be a very good predictor.

This paper describes inductive learning methods to train protein-independent probabilistic models of H-bond stability from molecular dynamics (MD) simulation trajectories of various proteins. The training data contains 32 input attributes (*predictors*) that describe an H-bond and its local environment in a conformation *c* and the output attribute is the probability that the H-bond will be present in an arbitrary conformation of this protein achievable from *c* within a time duration Δ. We model dependence of the output variable on the predictors by a regression tree.

Several models are built using 6 MD simulation trajectories containing over 4000 distinct H-bonds (millions of occurrences). Experimental results demonstrate that such models can predict H-bond stability quite well. They perform roughly 20% better than models based on H-bond energy alone. In addition, they can accurately identify a large fraction of the least stable H-bonds in a conformation. In most tests, about 80% of the 10% H-bonds predicted as the least stable are actually among the 10% truly least stable. The important attributes identified during the tree construction are consistent with previous findings.

We use inductive learning methods to build protein-independent probabilistic models to study H-bond stability, and demonstrate that the models perform better than H-bond energy alone.

A protein is a long sequence of amino-acids, called residues. Under normal physiological conditions, various forces (electrostatic, van der Waals, ...) lead the protein to fold into a compact structure made of secondary structure elements, α-helices and β-strands, connected by bends (called loops). An H-bond corresponds to the attractive electrostatic interaction between a covalent pair D—H of atoms, in which the hydrogen atom H is bonded to a more electronegative donor atom D, and an electronegative acceptor atom A. Due to their strong directional character, short distance ranges, and large number in folded proteins, H-bonds play a key role in both the formation and stabilization of protein structures [1-3]. While H-bonds involving atoms from close residues along the main-chain sequence stabilizes secondary structure elements, H-bonds between atoms in distant residues stabilize the overall 3D arrangement of secondary structure elements and loops.

H-bonds form and break while the conformation of a protein deforms. For instance, the transition of a folded protein from a non-functional state into a functional (e.g., binding) state may require some H-bonds to break and others to form [4]. So, to better understand the possible deformation of a folded protein, it is desirable to create a reliable model of H-bond stability. Such a model makes it possible to identify rigid groups of atoms in a given protein conformation and determine the remaining degrees of freedom of the structure [7]. Since most H-bonds in a protein conformation are quite stable, it is crucial that the model precisely identifies the least stable bonds. The intrinsic strength of an individual H-bond has been studied before from an energetic viewpoint [5,6]. However, potential energy alone may not be a very good predictor of H-bond stability. Other local interactions may reinforce or weaken an H-bond.

Let *c* be the conformation of a protein *P* at some time considered (with no loss of generality) to be 0 and *H* be an H-bond present in *c*. Let *M* (*c*) be the set of all physically possible trajectories of *P* passing through *c* and *π* be the probability distribution over this set. We define the *stability* of *H* in *c* over the time interval Δ by:

(1)

where *I* (*q*, *H*, *t*) is a Boolean function that takes value 1 if *H* is present in the conformation *q*(*t*) at time *t* along trajectory *q*, and 0 otherwise. The value can be interpreted as the probability that *H* will be present in the conformation of *P* at any specified time *t* (0, Δ), given that *P* is at conformation *c* at time 0. Our goal is to design a method for generating good approximations *σ* of . We also want these approximations to be protein-independent.

We use machine learning methods to train a stability model *σ* from a given set *Q* of MD simulation trajectories. Each trajectory *q* *Q* is a sequence of conformations of a protein. These conformations are reached at times *t _{i}* =

For each *h*, we compute a fixed list of predictors, some *numerical*, others *categorical*. Some are time-invariant, like the number of residues along the main-chain between the donor and acceptor atoms. Others are time-dependent. Among them, some describe the geometry of *h*, e.g., the distance between the hydrogen and the donor. Others describe the local environment of *h*, e.g., the number of other H-bonds within a certain distance from the mid-point of H.

We train *σ* as a function of these predictors. The predictor list defines a predictor space ∑ and every H-bond occurrence maps to a point in ∑. Given the input set *Q* of trajectories, we build a data table in which each row corresponds to an occurrence *h* of an H-bond present in a conformation *q*(*t _{i}*) contained in

We build σ as a binary regression tree [9]. This machine learning approach has been one of the most successful in practice. Regression trees are often simple to interpret. The method can work with both categorical and numerical predictors in a unified way, as shown in Section III. Each non-leaf node in a regression tree is a Boolean *split*. So, each node *N* of the tree determines a region of ∑ in which all the splits associated with the arcs connecting the root of the tree to *N* are satisfied. We say that an H-bond occurrence falls into *N* if it is contained in this region. The predicted stability value stored at a leaf node *L* is the average of the measured stability values by all the H-bond occurrences in the training data table that fall into *L*. We expect this average, which is taken over many pieces of trajectories, to approximate well the average defined in Equation (1).

Once a regression tree has been generated, it is used as follows. Given an H-bond *H* in an arbitrary conformation *c* of an arbitrary protein, the leaf node *L* of the tree into which *H* falls is identified by calculating the values of the necessary predictors for *H* in *c*. The predicted stability value stored at *L* is returned.

We construct a model *σ* as a binary regression tree using the CART method [9]. The tree is generated recursively in a top-down fashion. When a new node *N* is created, it is inserted as a leaf of the tree if a predefined depth has been reached or if the number of *h* falling into *N* is smaller than a predefined threshold. Otherwise, *N* is added as an intermediate node, its split is computed, and its left and right children are created. A split *s* is defined by a pair (*p*, *r*), where *p* is the *split predictor* and *r* is the *split value*. If *p* is a numerical predictor, then *r* is a threshold on *p*, and *s* *p* <*r*. If *p* is a categorical predictor, then *r* is a subset of categories, and *s* *p* *r*. We define the *score w*(*p*, *r*) of split *s* = (*p*, *r*) at a node *N* as the reduction of variance in measured stability that results from *s*. The algorithm chooses the split—both the predictor and the split value—that has the largest score. Only a relatively small subset of predictors selected by the training algorithm is eventually used in a regression tree.

To prevent model overfitting, we limit tree depth to 5 in most of our experiments and limit the minimal number of training samples in an intermediate node to be 10. We further prune the obtained tree using the following adaptive algorithm. We initially set aside a fraction of the training data table called *validation subset*. Once a tree has been constructed pruning is an iterative process. At each step, one intermediate node *N* whose split has minimal score becomes a leaf node by removing the sub-tree rooted at *N*. This process creates a sequence of trees with decreasing numbers of nodes. We compute the mean square error of the predictions made by each tree on the validation subset. The tree with the smallest error is selected.

We used 6 MD simulation trajectories picked from different sources and called hereafter *1c9oA*, *1e85A*, *1g9oA_1*, and *1g9oA_2* from [10], *complex* from [11], and *1eia* (generated by us). In all of them the time interval *δ* between two successive ticks is 1ps. Table Table11 indicates the protein simulated in each trajectory, its number of residues, the force field used by the simulator, and the duration of the trajectory. Each trajectory starts from a folded conformation resolved by X-ray crystallography.

From each trajectory we derived a separate data table in which the rows represent H-bond occurrences. Last two columns in Table Table11 list the number of distinct H-bonds detected in each trajectory and the total number of H-bond occurrences extracted. Note that *complex* was generated for a complex of two molecules. All H-bonds occurring in this complex are taken into account in the corresponding data table.

The values of the time-varying predictors are subject to thermal noise. Since a model *σ* will in general be used to predict H-bond stability in a protein conformation sampled using a kinematic model ignoring thermal noise (*e.g.*, by sampling the dihedral angles *ϕ*, *ψ*, and *χ*) [7], we chose to average the values of these predictors over *l*' ticks to remove thermal noise. More precisely, the value of a predictor stored in the row of the data table corresponding to an H-bond occurrence in *q*(*t _{i}*) is the average value of this predictor in , where . Our analysis shows that

The performance of a regression model can be measured by the root mean square error (RMSE) of the predictions on a test dataset. For a data table *T* = {(*x*_{1}, *y*_{1}), (*x*_{2}, *y*_{2})},…, (*x _{n}*,

Our goal is to train models to predict the stability of H-bonds in any protein. So, we trained models on data tables obtained by mixing subsets of 5 data tables and we tested these models on the remaining data table. For each combination of 5 data tables, we trained 4 groups of models varying in the tree’s maximal depth (5 or 15) and in the fraction of H-bond occurrences randomly taken from each data table (10% or 50%). For each group we trained 10 models. Hence, in total, 240 models were generated. Table Table22 shows the mean RBED value for each combination of data tables and each group of models. In columns 3 through 8 we indicate the data table used for testing the models trained on a combination of the 5 other data tables. Figure Figure11 shows a partial tree trained with combinations of all tables, except *1c9oA*.

Top 2 layers of a regression tree trained with combination of all tables, except *1c9oA*. The actual tree contains 55 nodes. Each path from the root to a node defines a conjunction of criteria for H-bonds with a certain mean stability. Here, Dist_H_A (the **...**

RBED values show that regression tree model significantly reduces base error and keeps predictive power when applied to a protein not present in the training data. Moreover, the variance of RBED values is very small, meaning that the training process yields models that are stable in performance. Furthermore, the RBED values are lower for models tested on *complex*. Recall that the trajectory *complex* was generated for a complex made of a protein and a ligand, while every other trajectory was generated for a single protein. So, it is likely that *complex* contains H-bonds in situations that did not occur in any of the other trajectories. Both deeper trees and larger data fractions tend to improve model accuracy, but the very small gain is not worth the additional model or computation complexity.

We've checked whether regression models can predict the stability of H-bonds more accurately than potential energy alone. Table Table33 presents the mean RBED value for a model obtained in the first row of Table Table22 relative to the base model that is a regression tree built from the same training data using FIRST_energy as the only predictor. FIRST_energy is a modified Mayo potential [5] implemented in FIRST (a protein rigidity analysis software) [7]. Comparison on all 6 data tables show that the more complex models are significantly more accurate than the models based on FIRST_energy alone.

Most H-bond occurrences tend to be stable. So, accurately identifying the weakest ones is important if one wishes to predict the possible deformation of a protein [7]. To evaluate how well our models identify the least stable H-bonds occurrences, we first identify the subset *S* of the 10% H-bond occurrences with the smallest measured stability in each test table *T*. Using a regression tree σ obtained in Section II, we sort the H-bond occurrences in *T* in ascending order of predicted stability and we compute the fraction *w* [0,1] of *S* that is contained in the first 100×u% occurrences in this sorted list, for successive values of *u* [0,1]. We call the function *w*(*u*) the identification curve of the least stable H-bonds for σ.

Figure Figure22 plots the identification curve for *1c9oA*. It consists of three curves: the red curve is the (fictitious) ideal identification curve, the blue curve is obtained with one (randomly picked) regression tree computed in Section II, and the green curve is obtained by sorting H-bond occurrences in decreasing values of FIRST_energy. Plots on other proteins present similar curve shapes. For models tested on data tables except *complex*, about 80% of the 10% H-bond occurrences predicted as the least stable are actually among the 10% truly least stable. The results for *complex* are less satisfactory because of the reasons discussed in Section II. The regression models are consistently better than the FIRST_energy-only models, though for *1eia* the difference is small.

In all our regression trees the root split was done with predictor Dist_H_A (the distance between the hydrogen and acceptor atoms), which therefore appear as the single most discriminative attribute to predict H-bond stability. This observation is consistent with previous findings. Levitt [6] found that most stable H-bonds have Dist_H_A less than 2.07Å. Jeffrey and Saenger [14] also suggested that Dist_H_A is a key attribute affecting H-bond stability, with a value less than 2.2Å for moderate to strong H-bonds. Consistent with these previous findings, the split values of the deepest Dist_H_A nodes in all our regression trees are around 2.1Å. This distance was observed in [6] to sometimes fluctuate by up to 3Å in stable H-bonds, due to high-frequency atomic vibration. This observation supports our decision to average predictor values over windows of *l*’ ticks.

Predictor FIRST_energy is often used in splits close to the root. This is not surprising since it is a function of several other pertinent predictors: Dist_H_A, Angle_D_H_A (the angle between the donor, the hydrogen atom, and the acceptor), Angle_H_A_AA (the angle between the hydrogen atom, the acceptor, and the atom covalently-bonded to the acceptor), and the hybridization state of the bond. Some other distance-based predictors (Dist_D_AA, Dist_D_A, Dist_H_D), angle-based predictors and Ch_type (describing whether the donor and acceptor are from main-chain or side-chain) predictor appear often in regression trees, but closer to the leaf nodes. They nevertheless play a significant role in predicting H-bond stability. For example, as shown in Figure Figure1,1, if Angle_H_A_AA is at least 105Â°, the stability is very high (about 0.96); otherwise, it drops to 0.71. The preference for larger angle matches well with the well-known linearity of H-bonds [14].

In order to get a more quantitative measure of the relative impact of the predictors on H-bond stability, we define the *importance* of a predictor *p* in a regression tree by: , where *N _{p}* is the set of nodes where the split is made using

Overall, we observe that predictors that describe the local environment of an H-bond play a relatively small role in predicting its stability. In particular, we had expected that descriptors such Num_hb_spaceNbr and Num_hb_spaceRgdNbr, which count the number of other H-bonds located in the neighborhood of the analyzed H-bond, would have had more importance. However, this may reflect the fact that the MD simulation trajectories used in our tests are too short to contain enough information to infer the role of such predictors. Indeed, while transitions between meta-stable states are rare in those trajectories, predictors describing local environments may have greater influence on the stability of H-bonds that must break for such transitions to happen. So, longer trajectories may eventually be needed to better model H-bond stability.

We have described machine learning methods to train protein-independent regression trees modeling H-bond stability in proteins. Test results demonstrate that trained models can predict H-bond stability quite well. In particular, their performance is significantly better (roughly 20% better) than that of a model based on H-bond energy alone. They can accurately identify a large fraction of the least stable H-bonds in a given conformation. However, our results also suggest that better results could be obtained with a richer set of MD simulation trajectories. In particular, the trajectories used in our experiments might be too short to characterize the stability of H-bonds that break and form during a transition between meta-stable states.

We believe that the training methods could be improved in several ways:

- It would be better to averaging predictor values before sub-sampling MD simulation trajectories. This would reduce the risk of filtering out changes in predictor values that are important for H-bond stability. Unfortunately, in our trajectories we only had access to the data after sub-sampling.

- More sophisticated learning techniques could be used. For example, instead of generating a single tree, we could generate an ensemble of trees, such as Gradient Boosting Trees [16].

- Finally, the notion of stability itself could be refined, for example by distinguishing between the case where an H-bond frequently switches on and off during a prediction window and the case where it rarely switches.

All four authors, IC, PY, MM, and JCL, participated in the formulation of the problem, the design of its solution, and the analysis and the interpretation of the results. PY prepared the experimental data. IC adapted a previously available CART software package and ran the experiments. All authors contributed to the writing of the manuscript, read and approved the final manuscript.

The authors declare that they have no competing interests.

This work was supported in part by a grant from the KAUST-Stanford Academic Excellence Alliance program. The authors thank L. Kavraki (Rice Univ.), V. Pande (Stanford), M. Levitt (Stanford), and J. Wang Tsai (Univ. of the Pacific) for providing us MD simulation trajectories and for useful comments during our work.

This article has been published as part of *BMC Bioinformatics* Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S1.

- Baker EN. Hydrogen bonding in biological macromolecules. International Tables for Crystallography. 2006;F:546–552. Chapter 22.2. full_text.
- Fersht AR, Serrano L. Principles in protein stability derived from protein engineering experiments. Curr. Opin. Struct. Biol. 1993;3:75–83. doi: 10.1016/0959-440X(93)90205-Y. [Cross Ref]
- Schell D, Tsai J, Scholtz JM, Pace CN. Hydrogen bonding increases packing density in the protein interior. Proteins. 2006;63:278–282. doi: 10.1002/prot.20826. [PubMed] [Cross Ref]
- Bikadi Z, Demko L, Hazai E. Functional and structural characterization of a protein based on analysis of its hydrogen bonding network by hydrogen bonding plot. Arch Biochem Biophys. 2007;461:225–234. doi: 10.1016/j.abb.2007.02.020. [PubMed] [Cross Ref]
- Dahiyat BI, Gordon DB, Mayo SL. Automated design of the surface positions of protein helices. Protein Science. 1997;6:1333–1337. doi: 10.1002/pro.5560060622. [PubMed] [Cross Ref]
- Levitt M. Molecular dynamics of hydrogen bonds in bovine pancreatic trypsin unhibitor protein. Nature. 1981;294:379–380. doi: 10.1038/294379a0. [PubMed] [Cross Ref]
- Thorpe MF, Lei M, Rader AJ, Jacobs DJ, Kuhn LA. Protein flexibility and dynamics using constraint theory. J. Mol. Graph. Model. 2001;19:60–69. doi: 10.1016/S1093-3263(00)00122-4. [PubMed] [Cross Ref]
- McDonald IK, Thornton JM. Satisfying hydrogen bonding potential in proteins. J. Mol. Biol. 1994;238:777–793. doi: 10.1006/jmbi.1994.1334. [PubMed] [Cross Ref]
- Breiman L, Friedman JH, Ilshen RA, Stone CJ. Classification and regression trees. CRC Press; 1984.
- Joo H, Qu X, Swanson R, McCallum CM, Tsai J. Modeling the dependency of residue packing upon backbone conformation using molecular dynamics simulation. Comput. Biol. Chem. 2010. [PMC free article] [PubMed]
- Haspel N, Ricklin D, Geisbrecht B, Lambris JD, Lydia EK. Electrostatic contributions drive the interaction between staphylococcus aureus protein Efb-C and its complement target C3d. Protein Sci. 2008;17:1894–1906. doi: 10.1110/ps.036624.108. [PubMed] [Cross Ref]
- Levitt M, Hirshberg M, Sharon R, Daggett V. Potential Energy Function and Parameters for simulations of the Molecular Dynamics of Proteins and Nucleic Acids in Solution. Computer Physics Communications. 1995;91:215–231. doi: 10.1016/0010-4655(95)00049-L. [Cross Ref]
- Srinivasan J, Trevathan M, Beroza P, Case D. Application of a pairwise generalized Born model to proteins and nucleic acids: inclusion of salt effects. Theor. Chem. Acc. 1999;101:426–434.
- Jeffrey GA, Saenger W. Springer-Verlag. Germany; 1991. Hydrogen bonding in biological structures.
- Tuv E, Borisov A, Torkokola K. Best subset feature selection for massive mixed-type problems. IDEAL. 2006;4224:1048–1056.
- Friedman JH. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics. 2000;29:1189–1232. doi: 10.1214/aos/1013203451. [Cross Ref]

Articles from BMC Bioinformatics are provided here courtesy of **BioMed Central**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |