To no small extent, our work presented here was motivated by the observation that in many cases, the differences in accuracy achieved by RNA secondary structure prediction methods are quite small on average, but tend to vary very significantly between individual RNAs
[

5,

6]. While this is not surprising, it suggests that care should be taken when assessing different prediction methods to ensure statistically meaningful results, and that potentially, benefits could be derived from combining predictions obtained from different methods. The statistical procedures we use in this work make it possible to assess statistical significance in a well-established, quantitative and yet computationally affordable way, and our

*AveRNA* procedure provides a practical way for realising the benefits inherent in a set of complementary prediction methods.

Our results demonstrate that there has, indeed, been steady progress in the prediction accuracy obtained from energy-based RNA secondary structure prediction methods. The fact that

*CONTRAfold 1.1* provides no statistically significant improvement in accuracy over the standard

*T99* energy model when both are evaluated on our large and diverse set of reference structures needs to be viewed in light of the fact that

*CONTRAfold 1.1* was trained on a limited set of RNA structures from the RFam database. The fact that

*CONTRAfold 2.0*, which was trained on the the same larger and richer set used by Andronescu

*et al.*[

4], performs much better further highlights the importance of the training set used as a basis for empirically optimising the performance of prediction methods. It is interesting to observe that the performance difference between

*CONTRAfold 2.0* and

*NOM-CG*, which are trained on the same set of references structures, are insignificant, which indicates that both methods are equally effective in making use of the information inherent in this set. However,

*NOM-CG*, thanks to its additional use of thermodynamic data, produces a physically plausible energy model, while the probabilistic model underlying

*CONTRAfold 2.0* does not produce realistic free energy values.

We further interpret the fact that

*DIM-CG*,

*CG*^{},

*BL *^{} and

*BL-FR*^{} all perform significantly better than

*CONTRAfold 2.0* as evidence that the thermodynamic data used by the former methods can effectively inform methods for optimising prediction accuracy based on data. Our statistical analysis provides further support for the claim that the computationally more expensive Boltzmann Likelihood parameter estimation method leads to better results than the Constraint Generation method, and that the additional use of probabilistic feature relationships enables further significant improvements
[

5].

The accuracy results we obtained for the

*MaxExpect* procedure
[

6] and for

*Centroidfold*[

7] are markedly lower than those reported in the respective original studies, mainly because our evaluation is based on a more extensive set of reference structures. However, we note that the underlying approaches of maximizing expected base-pair accuracy and

*γ*−centroid estimators can in principle be applied to any prediction method that produces probability distributions over the secondary structures of a given sequence. We therefore expect that these ideas can eventually be used in combination with parameter estimation methods, such as the ones that gave rise to the

*CG *^{},

*BL*^{} and

*BL-FR*^{} parameter sets.

The results of our correlation analysis revealed that prediction methods whose accuracy over the entire benchmark set does not differ much (such as

*T99* and

*CONTRAfold* 1.1) show large differences in accuracy on many individual RNAs. Consistent with earlier observations that predictions that are slightly suboptimal according to a given energy model can sometimes be much more accurate (see,

*e.g.*,
[

6]), we conjecture that this is a consequence of systematic weaknesses (such as the lack of accounting for interactions between non-neighbouring bases or the use of an overly simplistic energy model for multiloops) and inaccuracies (for example, in thermodynamic measurements) in the energy models underlying these procedures. Particularly when using automated methods for optimising the parmaters of a given energy models, such weaknesses and inaccuracies could easily lead to multiple solutions that show similar performance on average, but give very different results on many individual RNAs.

This situation, while at the first glance somewhat unsatisfactory, provides the basis for our

*AveRNA* approach, which obtains more accurate predictions by means of weighted combination of the predictions obtained from a set of given prediction procedures. While our study is focussed on the prediction of pseudoknot-free MFE structures, we note that the weighted sum calculation performed by

*AveRNA* on base pairing matrices naturally extends to methods that produce base pairing probabilities and to pseudoknotted prediction methods. In the latter case, the calculation of the weighted probability matrix

*P*(

**w**) proceeds exactly as in the pseudoknot-free case, but the procedure used for structure inference would have to be modified to produce pseudoknotted MEA structures. In the former case, probability matrices are used instead of Boolean matrices, and the result of the calculation would be normalised to yield a well-formed base pairing probability matrix. (We note that, in light of very recent empirical results based on the statistical approach first developed in the context of the work presented here, it is not clear that MEA structures determined from individual base pairing probability matrices are generally more accurate than MFE structures for the same energy model
[

29]; however, it is possible that higher accuracies can be obtained via ensemble-based MEA predictions from weighted combinations of multiple base pairing matrices.) We pursued neither of these directions here, because currently, the number of high-accuracy prediction procedures for pseudoknotted RNA structures of base-pair probabilities is more limited and because the development and assessment of extensions of

*AveRNA* to those cases pose challenges that are beyond the scope of this work, but we strongly believe that these directions are very promising and should be explored further in the future.

We note, however, that *AveRNA* as presented here already realises an advantage usually found only in approaches that produce base pairing probabilities: an easy and intuitive way for assessing the confidence with which certain bases are predicted to pair or remain unpaired, by means of inspecting the entries of the probability matrix *P*(**w**). Values close to one indicate base pairs that are predicted consistently by many of the underlying prediction procedures, and values close to zero indicate bases that are consistently predicted to be unpaired. Intermediate values indicate base pairings for which there is more disagreement between the given prediction procedures. From the fact that by thresholding these values, the sensitivity and specificity (PPV) for predicting base pairs can be increased quite substantially (as seen in Figure
), we conclude that the set of prediction procedure used by *AveRNA* in this work is sufficiently diverse to allow for this interpretation. The threshold parameter *θ* controls the trade-off between sensitivity and PPV in an intuitive way. It is conceivable that even higher sensitivity and PPV values can be obtained by optimising the weight parameters of *AveRNA* specifically for that purpose (something we did not attempt in this work).