Many structure prediction methods employ similar energy models, evolve from the same underlying algorithm, or are extended from a previous version. It is possible that the prediction results of those related methods are correlated. In other words, two programs may make similar errors in structure prediction that would result in similar prediction accuracies on a number of sequences. The correlation was tested on a set of benchmark results published previously (
16). This benchmark consists of 400 tRNA sequences (
28) predicted by 12 methods, namely, Fold (
29), Dynalign (
30), Multilign (
16), FoldalignM (
31), mLocARNA (
32), MASTR (
33), Murlet (
34), RAF (
35), RNASampler (
36), RNAshapes (
37), StemLoc (
38) and RNAalifold (
39). Fold finds the lowest free energy structure for a single sequence. Dynalign finds the lowest free energy structure that is conserved for two unaligned sequences. RNAalifold folds multiple RNA sequences, prealigned with ClustalW2 in this benchmark. The remaining methods predict secondary structures of multiple sequences without requiring that the sequences be aligned ahead of time.
In the benchmark, 400 tRNA molecules were randomly selected and divided into groups of 5, 10 or 20 sequences. A single calculation was run over multiple sequences (except for Fold and Dynalign). For Dynalign, one sequence is chosen to be used with each of the other sequences in the group for structure prediction, and for this one sequence, the structure from last Dynalign calculation is used for scoring (
16). The structures predicted in a single group are dependent on each other because the algorithms predict consensus structures. Thus, the sensitivities and PPVs for each sequence from the same calculation were averaged and this average was used as a single data point in the following statistical analyses. The sample size for the 5-, 10- and 20-sequence group predictions of 400 tRNA sequences are 80 (=

400/5), 40 (=400/10) and 20 (=400/20), respectively. By treating a single calculation as a data point instead of a single sequence as a data point, the statistical analysis is conservative in determining statistical significance because some power is lost in the calculation. This is specific to statistical tests of methods that use multiple sequences to find a consensus structure.
Correlation analysis
The most widely used correlation analysis is the Pearson product–moment correlation. It, however, is a parametric statistic designed to test a linear relationship between two variables normally distributed along an interval. For RNA structure prediction, the sensitivities or PPVs appear to be distributed far from the normal distribution (
Supplementary Figure S1). Because of this, the Spearman rank correlation coefficients, which are more robust to outliers with extreme value, were computed (
23). It is a non-parametric measure calculated with the ranks rather than the values of the observations. shows the matrices of correlation coefficients between sensitivities or PPVs predicted by any two algorithms over tRNA.
Multilign and Dynalign are correlated because Multilign is based on multiple Dynalign calculations. The correlation coefficient decreases from over 0.84 to ~0.65 when the number of sequences in one Multilign calculation increases from 5 to 20. This decreased correlation is expected because increasing the number of sequences in the Multilign prediction weakens the influence of individual Dynalign calculations. This is more obvious in the scatter plots in . The horizontal distances of points from the diagonal line in the scatter plot of the 10 or 20 sequence predictions are longer than those of 5 sequence predictions, showing that the decreased correlation coefficients are at least partially due to the further prediction improvement on a few sequences by increasing the sequence number from 5 to 20.
Statistical significance
To test whether an algorithm has better performance than another, the means of their sensitivities and PPVs are traditionally compared. Comparing mean scores calculated from a sample of sequences, however, is usually not sufficient to infer the relationships between two statistical populations, i.e. the prediction accuracies of two folding algorithms on all possible RNA sequences.
In addition to reporting means, many studies additionally report standard deviations, but this is also not enough information to determine whether two methods have significantly different performance. To statistically evaluate the prediction accuracy of a method compared to another, a two-tailed t-test can be used to infer whether their accuracy means are significantly different. The test should be two-tailed because the better method is unknown ahead of time.
Incorrectly, Welch's t-test of two samples could be performed. Welch's t-test is similar to Student's t-test but is intended for use with two samples having possibly unequal variances. This test is performed on sensitivities and PPVs predicted by any two methods. The P-values are shown in A. The P-value represents the risk of Type I error of rejecting the null hypothesis, that is the probability of claiming the two methods predict with different accuracies when actually their accuracies are the same.
Welch's
t-test does not require equal variance, but like the typical (Student)
t-test, it assumes the population from which the sample is drawn is normally distributed and the two populations are independent. Although the distribution of calculated sensitivities or PPVs is unlikely to be normal, a
t-test is still justified by Central Limit theorem (CLT) with a large sample size (i.e.
n
≥

30). The CLT says that

when the sample size n is large, where

is the sample mean,
µ and
σ are the population mean and population standard deviation. Replacing
σ with the sample standard deviation
s gives the
t-distribution of (
n
−

1) degrees of freedom,

(
40). The assumption of independence, however, is clearly violated as explained above by the fact that the methods have correlated prediction accuracy (). This correlation results in a systematic overestimation of the
P-values by Welch's
t-test in comparison with the paired test.
Given the observed correlation in prediction accuracy, a paired two-sample t-test is the appropriate statistical test. It calculates the differences of each observation between two methods and tests whether the mean of these differences is significantly different from zero. In the presence of significant correlation, paired t-tests have greater statistical power than unpaired t-tests by canceling the correlation when the two groups being compared occur in natural pairs. As shown in , the P-values calculated from independent two-sample t-test (A) tend to be systematically larger than those from paired t-test (B). Specifically, the differences of PPVs and sensitivities between Multilign and Dynalign are significant when judged using paired t-tests and an α chosen to be 5%. The significances, however, are obscured by the incorrect application of independent t-tests (A).
Powering the benchmarks
For those comparisons with P-values larger than 0.05, the benchmarks fail to demonstrate that there is a significant difference in prediction accuracies between the two methods with a low Type I error. There is a subtle but important distinction between failing to demonstrate that there is difference and demonstrating there is no difference. The caveat is that the benchmark might lack the power to detect the difference due to the small sample size and large data variations, if the difference does exist. How big should the sample size be to detect small difference with controlled Type I error? This is non-trivial because benchmarks are costly in computer time, so choosing an optimally sized data set can save computation time when comparing two folding algorithms. On the other hand, if the null hypothesis cannot be rejected, Type II error, which is the probability of failing to reject a false null hypothesis, needs to be controlled. Conventionally the probability of Type II error is denoted by β and is chosen to be no larger than 0.2.
Based on that, a pipeline is proposed to statistically evaluate the performance of algorithms using pairwise comparisons (). During the benchmark, data are accumulated sequentially over time until the available computation time or sequences run out or the null hypothesis is proven or disproven. Thus a sequential paired two-sample
t-test can be employed to reach an early conclusion regarding the comparison of prediction accuracy. This sequential style of interim estimate, however, can inflate the Type I error probability (
41,
42). Here the critical value is adjusted by simulations to accommodate this inflation. If the null hypothesis cannot be rejected at any of the interim tests, the power is calculated to see whether the null hypothesis can be accepted with high confidence. If no conclusion can be made, the process proceeds to the next stage of testing with benchmarks on an additional group of sequences. This step can be iterated until a conclusion is reached, computer resources are depleted, or until no more sequences are available.
To illustrate the pipeline, the performances of the same 12 methods as above were evaluated on groups of five 5S rRNA sequences (
43). In the Multilign paper, the sample size for 5S rRNA is only 20 (

=

100 sequences/5 sequences per calculation). It is expanded with an additional 1095 sequences to reach the sample size of 239. Then the statistical analysis proceeded sequentially for up to 12 stages, with 20 more predictions done on an extra 100 sequences at each stage (except that the last one only has 95 sequences because all available sequences were used). At the beginning, a
t-statistic for the first 20 predictions is calculated and compared to the adjusted critical values in the same manner as the paired two-sample
t-test. If the null hypothesis is neither rejected nor accepted with defined Types I and II error probabilities, the
t-test is repeated with 20 additional calculations. Many pairwise comparisons are shown to already have significant difference in the first stage (). Take Multilign versus RNAalifold for example, Multilign is significantly better than RNAalifold for both sensitivity and PPV with the original 100 sequences. For some cases, the iterative sampling method does refine the test to be able to draw conclusions. For example, using the first 20 groups of sequences, the
a priori power analysis estimates that 142 groups of sequences are needed to resolve the hypothesis of comparing the sensitivity between Multilign and Dynalign. This is much more than the 80 groups eventually used in the sequential test, which demonstrates that Multilign does have a higher average PPV than Dynalign. There are, however, also many inconclusive comparisons. For example, the null hypothesis that MASTR is as accurate as RNAalifold cannot be rejected, and it cannot be accepted either, for the inability to reject the null hypothesis may be due to the small power (0.18 and 0.20 for sensitivity and PPV), not because they truly have the same performance. Unfortunately, the available 5S rRNA sequences run out before sufficient power can be achieved to make a conclusion.