NMR data were analyzed using “RNAShifts”, a program designed to download and analyze RNA 1H NMR chemical shifts that have been deposited in the BMRB. (Locally derived shifts that have yet to be deposited can also be analyzed). All 131 depositions available in the BMRB were used in the current analysis except BMRB ID 5170, 6814, 4816, 15697, 15915, 5023, 4253, 4894, and 15257, which could not be reliably used because either the BMRB assignments didn’t match the published PDB assignments, or because there was no associated publication or PDB file that could be used to identify RNA secondary structure. As additional input, files were manually generated for each deposition, based on published structural studies, that identify for each residue (1) whether or not the residue is base-paired, (2) the nature of the base-pairing partner, (3) any long-range intra- and/or inter-molecular interactions (e.g., sites of protein binding or participation in A-minor or other RNA–RNA contacts), (4) participation in structured (e.g., GNRA; G/g = guanosine, N/n = any nucleotide; R/r = purine; A/a = adenosine) or unstructured loops. A representative input file is shown in Supplementary Table S1.
The analysis focused on shifts reported for the non-exchangeable H8, H2, H6, H5, H1′, H2′ and H3′ protons of the central base pair of three consecutive canonical Watson–Crick base-pairs (WC-BPs) (here called WC-BP triplets: ([5′−n(i−1)−Ni−n(i+1)]:[5′−n(j−1)−nj−n(j+1)]; Ni = nucleotide for which the NMR shifts are being evaluated; n = neighboring nucleotides), Fig. a. As additional parameters, we denoted if the n(i−1):n(j+1) or n(i+1):n(j−1) base pairs were at terminating positions in the RNA, and we identified the secondary structural elements adjacent to the WC-BP triplets (canonical or non-canonical WC-BP, bulges, loops, long-range RNA–RNA interactions, and RNA–protein/ligand interactions), Table S1.
Fig. 1 a Definitions used for base pair triplets. The chemical shifts of the N(i) residue are analyzed in this work, and this strand may be preceded by a base-paired (WC or GU wobble) nucleotide (pre_n) or a non-base paired residue (5loop), or followed by a (more ...)
We chose a relatively conservative approach in modeling the effect of the neighborhood of each central base pair. This was done because there are still, especially in comparison to proteins, relatively few chemical shift assignment sets for RNA deposited at the BMRB. Rather than using any non-linear or neural network approach we used an approach similar to the chemical shift increment method of Pretsch as used in predicting spectra of small organic molecules (Pretsch et al. 2009
). Thus, for the central residue of each WC-BP triplet, we defined the attributes describing the neighborhood of the central nucleotide as described above, and calculated the contribution that each attribute makes to the predicted chemical shift. The predicted chemical shift is then a base chemical shift plus the linear contribution of the value corresponding to each attribute present in that nucleotide’s environment. The contribution of each attribute was calculated by linear regression of the chemical shifts in our database of RNA chemical shifts with the set of explanatory variables represented by the neighborhood attributes. The constant term of our regression model corresponds to a nucleotide embedded in a triplet of Watson–Crick base pairs with a U (uridine) flanking it on both the 5′ and 3′ sides and Watson–Crick base-paired nucleotides at the 5′ and 3′ ends of the triplet.
Our analysis included a total of 15 potential variables, Table , of which only some might potentially contribute significantly to the shift of a specific atom in a given central nucleotide. Because the approach includes a large number of independent variables relative to the chemical shift datasets, there was a significant danger of over-fitting using a conventional linear regression algorithm. Over-fitting can lead to excellent prediction of the training set, but poor predictive capability on novel datasets. To minimize the risk of over-fitting we chose an algorithm, Pace Regression (Projection Adjustment by Contribution Estimation), that is capable of assessing the importance of each of the parameters. Calculations were performed using the Weka Machine Learning and Data Mining Library system, which allowed us to perform a statistical analysis of the prediction model (Witten et al. 2011
). Pace Regression is a linear regression system that uses various information criteria to assess the degree of importance of the regression variables (Wang and Witten 2002
). Thus it provides one solution to the subset selection problem: which subset of a set of potential regressors is the appropriate set to explain the data, and thereby minimize the risk of overfitting and maximize the predictive capability on previously unseen data.
Sequence variables and chemical shift corrections calculated by Pace regression
Use of Weka provided not only access to Pace Regression, but also various assessments of the quality of the predictions. In particular, we used 10-fold stratified cross-validation during our analysis. Rather than providing correlation coefficients and root mean squared (rms) deviations of the predictions using all the data in the prediction, this technique trains the model on 90 % of the data and then assesses the results of predicting the remaining 10 % of the data. The process is repeated 10 times, using a different subset of the data each time and derives the correlation coefficients and rms deviations based on the whole process. Pace regression was used independently on each atom type present in each of the four central nucleotides for a total of 19 regression calculations.
We were unable in our analysis to adequately identify and control for sample conditions (pH, temperature, ionic strength, etc.) and unusual molecular conformation, and there is a significant possibility of misassignment, especially of some atom types. Therefore, after dropping a single obvious major outlier, we minimized these effects by automatically trimming outliers and automatically adjusting the reference for the chemical shift sets. Automated outlier elimination was performed by running two passes of the Pace Regression for each atom/central nucleotide. In the first pass, the rms deviations between the experimental and predicted values were calculated using all of the data. Any data values that deviated from the predicted values by more than three times the rms deviation value were dropped, and a second pass of the Pace Regression was performed on the now trimmed dataset. Automatic re-referencing was achieved by performing the above analysis (including outlier detection) twice. In the first of these passes, the mean error of prediction was calculated for all the shifts from each BMRB file. Prior to the second pass, each shift was corrected by the mean deviation calculated for the corresponding BMRB file. The chemical shift corrections determined by this approach are listed in Table S3.
The RNAShifts program was written using JTcl (http://jtcl.kenai.com
) and Swank (http://swank.kenai.com
), which are the Java implementations of the Tcl programming language and Tk graphical user interface toolkit (Ousterhout and Jones 2010
). The analysis mode is run in three stages. The first loads BMRB files (fetching them from http://bmrb.wisc.edu
if necessary), extracts chemical shifts, and then uses the input template to assign attributes to each shift. The second stage reads the output of the first stage and generates input files in the format used by Weka. The third executes Weka multiple times for each proton type, manages the two passes used for outlier detection and generates various statistical output files. The graphical interface module allows plotting predicted and experimental data subject to various criteria for choosing subsets of the data and attributes for plotting. The RNAShifts program is available upon request from the author (BAJ).