In all our regression trees the root split was done with predictor Dist_H_A (the distance between the hydrogen and acceptor atoms), which therefore appear as the single most discriminative attribute to predict H-bond stability. This observation is consistent with previous findings. Levitt [6
] found that most stable H-bonds have Dist_H_A less than 2.07Å. Jeffrey and Saenger [14
] also suggested that Dist_H_A is a key attribute affecting H-bond stability, with a value less than 2.2Å for moderate to strong H-bonds. Consistent with these previous findings, the split values of the deepest Dist_H_A nodes in all our regression trees are around 2.1Å. This distance was observed in [6
] to sometimes fluctuate by up to 3Å in stable H-bonds, due to high-frequency atomic vibration. This observation supports our decision to average predictor values over windows of l
Predictor FIRST_energy is often used in splits close to the root. This is not surprising since it is a function of several other pertinent predictors: Dist_H_A, Angle_D_H_A (the angle between the donor, the hydrogen atom, and the acceptor), Angle_H_A_AA (the angle between the hydrogen atom, the acceptor, and the atom covalently-bonded to the acceptor), and the hybridization state of the bond. Some other distance-based predictors (Dist_D_AA, Dist_D_A, Dist_H_D), angle-based predictors and Ch_type (describing whether the donor and acceptor are from main-chain or side-chain) predictor appear often in regression trees, but closer to the leaf nodes. They nevertheless play a significant role in predicting H-bond stability. For example, as shown in Figure , if Angle_H_A_AA is at least 105Â°, the stability is very high (about 0.96); otherwise, it drops to 0.71. The preference for larger angle matches well with the well-known linearity of H-bonds [14
In order to get a more quantitative measure of the relative impact of the predictors on H-bond stability, we define the importance
of a predictor p
in a regression tree by:
, where Np
is the set of nodes where the split is made using p
) is the score of the split s
, and n
) is the number of H-bond occurrences falling into the node where split s
is made. We trained 10 models on data tables combining 10% of each of the 6
data tables. Importance scores for each predictor were averaged over these models and then linearly scaled to adjust the score of the least important predictor (with non-zero average importance) equal to 1. The average importance of every predictor appearing in at least one model is shown in Figure . The figure confirms that distance-based and angle-based predictors, as well as FIRST_energy, are the most important. It also shows that a number of other predictors—including Resi_name_H, Resi_name_A, and Range (difference in residue numbers of donor and acceptor) —have less, but still significant importance.
Overall, we observe that predictors that describe the local environment of an H-bond play a relatively small role in predicting its stability. In particular, we had expected that descriptors such Num_hb_spaceNbr and Num_hb_spaceRgdNbr, which count the number of other H-bonds located in the neighborhood of the analyzed H-bond, would have had more importance. However, this may reflect the fact that the MD simulation trajectories used in our tests are too short to contain enough information to infer the role of such predictors. Indeed, while transitions between meta-stable states are rare in those trajectories, predictors describing local environments may have greater influence on the stability of H-bonds that must break for such transitions to happen. So, longer trajectories may eventually be needed to better model H-bond stability.