|Home | About | Journals | Submit | Contact Us | Français|
With discovery of diverse roles for RNA, its centrality in cellular functions has become increasingly apparent. A number of algorithms have been developed to predict RNA secondary structure. Their performance has been benchmarked by comparing structure predictions to reference secondary structures. Generally, algorithms are compared against each other and one is selected as best without statistical testing to determine whether the improvement is significant. In this work, it is demonstrated that the prediction accuracies of methods correlate with each other over sets of sequences. One possible reason for this correlation is that many algorithms use the same underlying principles. A set of benchmarks published previously for programs that predict a structure common to three or more sequences is statistically analyzed as an example to show that it can be rigorously evaluated using paired two-sample t-tests. Finally, a pipeline of statistical analyses is proposed to guide the choice of data set size and performance assessment for benchmarks of structure prediction. The pipeline is applied using 5S rRNA sequences as an example.
There has been an explosion in our understanding of roles for RNA in cellular processes and gene expression in recent decades. RNA can form complex three dimensional structure either alone or with proteins to catalyze RNA splicing (1), catalyze peptide bond formation (2), guide protein localization (3), and tune gene regulation (4,5). Prediction of RNA secondary structure, the set of base pairing interactions between A-U, G-U and G-C, facilitates the development of hypotheses that connect structure to function. It also underlies a number of applications such as non-coding RNA detection (6–8), RNA tertiary structure prediction (9), siRNA design (10), miRNA target prediction (11) and structure design (12). There are many algorithms that have been developed to apply to specific situations with their own strengths and limitations.
The performance of a given structure prediction method is usually benchmarked on a set of RNA families by comparing predicted structures with the known secondary structures. Two statistical scores, sensitivity and positive predictive value (PPV), are commonly tabulated to determine the accuracy of the prediction methods (13). Sensitivity is the fraction of known pairs correctly predicted and PPV is the fraction of predicted pairs in the known structure. The two scores are sufficient to evaluate the accuracy. Another measure, the Matthews correlation coefficient (MCC) (14,15), is used as a single score that summarizes both sensitivity and PPV. It is defined as: MCC=, where TP, TN, FP and FN represent the number of true positive, true negative, false positive and false negative base pairs. It can be approximated by the geometric mean of sensitivity and PPV (15). Although MCC is a more succinct measure, by combining sensitivity and PPV it loses the information about capability and quality of a given method.
Traditionally, the averages of the statistical scores, either over families or sequences, are calculated to compare the accuracies of different folding algorithms. In this contribution, a statistical test is introduced to evaluate the performance of RNA folding methods against each other. Specifically, the prediction accuracies of some methods are shown to be correlated over sequences, and the paired two-sample t-test is proposed to compare accuracies of methods. Finally, precision and power analyses are suggested to determine the necessary dataset size for the benchmark to control the probabilities of hypothesis testing errors. An example is shown for programs that predict a consensus structure for multiple sequences, using all 5S rRNA sequences with reference structures available (16).
The prediction results of 12 RNA folding algorithms are taken from a previous study for this statistical analysis (16). That study reported a new algorithm for predicting an RNA secondary structure common to three or more sequences. When calculating sensitivity and PPV, a predicted base pair, i − j, counted as a correctly predicted pair if i−j, (i+1)−j, (i−1)−j, i−(j+1) or i−(j−1) pair is in the known structure (17). This is important because structures are compared to structures determined by comparative sequence analysis where there can be uncertainty in the exact pair match (18). Additionally, thermal noise can cause the actual structure to sample multiple pairs. For example, there is evidence in the thermodynamic studies of single RNA bulges that multiple conformational states exist (19,20).
All the following analyses were performed with the R statistical environment (21). Scripts are available for download from the Mathews lab website, http://rna.urmc.rochester.edu. All the figures were plotted with the Lattice package in R (22).
To evaluate the independence between prediction results by two RNA folding algorithms, the Spearman rank correlation coefficient was calculated. It is a Pearson correlation coefficient between the ranks of two variables. It is calculated using the equation: , where xi and yi are the ranks of average sensitivity or PPV predicted by the two algorithms on the ith group of sequences, and and are the means of the ranks. In case of tied observations, the arithmetic average of the rank numbers was used (23).
t-Tests were performed to calculate P-values in two ways. Assuming their prediction results were uncorrelated, an incorrect assumption, independent two-sample Welch’s t-tests were used to compare the sensitivities or PPVs predicted by two algorithms with the null hypothesis that the means of the two samples are the same. Alternatively, paired two-sample t-tests were performed. Essentially the differences of individual sensitivities or PPVs from two algorithms were calculated to reduce the problem to a one-sample t-test with null hypothesis that the mean of the differences is zero (24).
The probability of Type I error, denoted by α, is the probability of rejecting a true null hypothesis. The precision analysis relates the width of the confidence interval to the sample size by controlling α. A (1−α) percent confidence interval of a two-tailed paired t-test is defined as , where δ is the sample mean of the differences between sensitivities (or PPVs) of two algorithms, Sd is the sample standard deviation, n is the sample size and tα/2,(n−1) is the critical value for the t distribution with (n−1) degrees of freedom at the significance level of α. For an observed Sd, determining sample size n such that the confidence interval does not contain the point of zero, i.e. ≤δ, gives the estimated sample size to reject the null hypothesis with the probability of Type I error no greater than α (25).
The power of a statistical test is the probability of rejecting a false null hypothesis. It equals (1−β), where β refers to the probability of the Type II error, namely, the probability of failing to reject a false null hypothesis. The sample size required for a given power can be prospectively estimated as n=Vd (Z1−β+Z1−α/2)2/δ2, where δ, Vd are the hypothetical mean and variance of the differences between two groups of scores, α and β are the probabilities of Types I and II errors, and Z1−β and Z1−α/2 are Z statistics of a standard normal distribution at the (1−β)th and (1−α/2)th quantiles (25). Conversely, a posteriori β can be calculated after the statistical test from the equation to show the confidence to accept the null hypothesis with the given sample size and α. While precision analysis is often used to define the width of the confidence interval, a priori power analysis is used more to estimate a sample size to define how small a difference can be detected and with what degree of certainty.
Sequential testing is an alternative to fixed sample size testing (26,27). It is designed to reduce sample sizes without impacting power. Sample size is adaptively determined based on the accumulated data in a sequential test. It avoids the need for a single sample size estimate that would be larger than required for many cases. The methodology on the sequential procedure proposed in this article is in the Supplementary Material. Briefly, to prevent inflation of the Type I error rate introduced by sequential testing as opposed to testing with a fixed sample size, an adaptive rejection rule needs to be defined.
Many structure prediction methods employ similar energy models, evolve from the same underlying algorithm, or are extended from a previous version. It is possible that the prediction results of those related methods are correlated. In other words, two programs may make similar errors in structure prediction that would result in similar prediction accuracies on a number of sequences. The correlation was tested on a set of benchmark results published previously (16). This benchmark consists of 400 tRNA sequences (28) predicted by 12 methods, namely, Fold (29), Dynalign (30), Multilign (16), FoldalignM (31), mLocARNA (32), MASTR (33), Murlet (34), RAF (35), RNASampler (36), RNAshapes (37), StemLoc (38) and RNAalifold (39). Fold finds the lowest free energy structure for a single sequence. Dynalign finds the lowest free energy structure that is conserved for two unaligned sequences. RNAalifold folds multiple RNA sequences, prealigned with ClustalW2 in this benchmark. The remaining methods predict secondary structures of multiple sequences without requiring that the sequences be aligned ahead of time.
In the benchmark, 400 tRNA molecules were randomly selected and divided into groups of 5, 10 or 20 sequences. A single calculation was run over multiple sequences (except for Fold and Dynalign). For Dynalign, one sequence is chosen to be used with each of the other sequences in the group for structure prediction, and for this one sequence, the structure from last Dynalign calculation is used for scoring (16). The structures predicted in a single group are dependent on each other because the algorithms predict consensus structures. Thus, the sensitivities and PPVs for each sequence from the same calculation were averaged and this average was used as a single data point in the following statistical analyses. The sample size for the 5-, 10- and 20-sequence group predictions of 400 tRNA sequences are 80 (=400/5), 40 (=400/10) and 20 (=400/20), respectively. By treating a single calculation as a data point instead of a single sequence as a data point, the statistical analysis is conservative in determining statistical significance because some power is lost in the calculation. This is specific to statistical tests of methods that use multiple sequences to find a consensus structure.
The most widely used correlation analysis is the Pearson product–moment correlation. It, however, is a parametric statistic designed to test a linear relationship between two variables normally distributed along an interval. For RNA structure prediction, the sensitivities or PPVs appear to be distributed far from the normal distribution (Supplementary Figure S1). Because of this, the Spearman rank correlation coefficients, which are more robust to outliers with extreme value, were computed (23). It is a non-parametric measure calculated with the ranks rather than the values of the observations. Figure 1 shows the matrices of correlation coefficients between sensitivities or PPVs predicted by any two algorithms over tRNA.
Multilign and Dynalign are correlated because Multilign is based on multiple Dynalign calculations. The correlation coefficient decreases from over 0.84 to ~0.65 when the number of sequences in one Multilign calculation increases from 5 to 20. This decreased correlation is expected because increasing the number of sequences in the Multilign prediction weakens the influence of individual Dynalign calculations. This is more obvious in the scatter plots in Figure 2. The horizontal distances of points from the diagonal line in the scatter plot of the 10 or 20 sequence predictions are longer than those of 5 sequence predictions, showing that the decreased correlation coefficients are at least partially due to the further prediction improvement on a few sequences by increasing the sequence number from 5 to 20.
To test whether an algorithm has better performance than another, the means of their sensitivities and PPVs are traditionally compared. Comparing mean scores calculated from a sample of sequences, however, is usually not sufficient to infer the relationships between two statistical populations, i.e. the prediction accuracies of two folding algorithms on all possible RNA sequences.
In addition to reporting means, many studies additionally report standard deviations, but this is also not enough information to determine whether two methods have significantly different performance. To statistically evaluate the prediction accuracy of a method compared to another, a two-tailed t-test can be used to infer whether their accuracy means are significantly different. The test should be two-tailed because the better method is unknown ahead of time.
Incorrectly, Welch's t-test of two samples could be performed. Welch's t-test is similar to Student's t-test but is intended for use with two samples having possibly unequal variances. This test is performed on sensitivities and PPVs predicted by any two methods. The P-values are shown in Figure 3A. The P-value represents the risk of Type I error of rejecting the null hypothesis, that is the probability of claiming the two methods predict with different accuracies when actually their accuracies are the same.
Welch's t-test does not require equal variance, but like the typical (Student) t-test, it assumes the population from which the sample is drawn is normally distributed and the two populations are independent. Although the distribution of calculated sensitivities or PPVs is unlikely to be normal, a t-test is still justified by Central Limit theorem (CLT) with a large sample size (i.e. n≥30). The CLT says that when the sample size n is large, where is the sample mean, µ and σ are the population mean and population standard deviation. Replacing σ with the sample standard deviation s gives the t-distribution of (n−1) degrees of freedom, (40). The assumption of independence, however, is clearly violated as explained above by the fact that the methods have correlated prediction accuracy (Figure 1). This correlation results in a systematic overestimation of the P-values by Welch's t-test in comparison with the paired test.
Given the observed correlation in prediction accuracy, a paired two-sample t-test is the appropriate statistical test. It calculates the differences of each observation between two methods and tests whether the mean of these differences is significantly different from zero. In the presence of significant correlation, paired t-tests have greater statistical power than unpaired t-tests by canceling the correlation when the two groups being compared occur in natural pairs. As shown in Figure 3, the P-values calculated from independent two-sample t-test (A) tend to be systematically larger than those from paired t-test (B). Specifically, the differences of PPVs and sensitivities between Multilign and Dynalign are significant when judged using paired t-tests and an α chosen to be 5%. The significances, however, are obscured by the incorrect application of independent t-tests (Figure 3A).
For those comparisons with P-values larger than 0.05, the benchmarks fail to demonstrate that there is a significant difference in prediction accuracies between the two methods with a low Type I error. There is a subtle but important distinction between failing to demonstrate that there is difference and demonstrating there is no difference. The caveat is that the benchmark might lack the power to detect the difference due to the small sample size and large data variations, if the difference does exist. How big should the sample size be to detect small difference with controlled Type I error? This is non-trivial because benchmarks are costly in computer time, so choosing an optimally sized data set can save computation time when comparing two folding algorithms. On the other hand, if the null hypothesis cannot be rejected, Type II error, which is the probability of failing to reject a false null hypothesis, needs to be controlled. Conventionally the probability of Type II error is denoted by β and is chosen to be no larger than 0.2.
Based on that, a pipeline is proposed to statistically evaluate the performance of algorithms using pairwise comparisons (Figure 4). During the benchmark, data are accumulated sequentially over time until the available computation time or sequences run out or the null hypothesis is proven or disproven. Thus a sequential paired two-sample t-test can be employed to reach an early conclusion regarding the comparison of prediction accuracy. This sequential style of interim estimate, however, can inflate the Type I error probability (41,42). Here the critical value is adjusted by simulations to accommodate this inflation. If the null hypothesis cannot be rejected at any of the interim tests, the power is calculated to see whether the null hypothesis can be accepted with high confidence. If no conclusion can be made, the process proceeds to the next stage of testing with benchmarks on an additional group of sequences. This step can be iterated until a conclusion is reached, computer resources are depleted, or until no more sequences are available.
To illustrate the pipeline, the performances of the same 12 methods as above were evaluated on groups of five 5S rRNA sequences (43). In the Multilign paper, the sample size for 5S rRNA is only 20 (=100 sequences/5 sequences per calculation). It is expanded with an additional 1095 sequences to reach the sample size of 239. Then the statistical analysis proceeded sequentially for up to 12 stages, with 20 more predictions done on an extra 100 sequences at each stage (except that the last one only has 95 sequences because all available sequences were used). At the beginning, a t-statistic for the first 20 predictions is calculated and compared to the adjusted critical values in the same manner as the paired two-sample t-test. If the null hypothesis is neither rejected nor accepted with defined Types I and II error probabilities, the t-test is repeated with 20 additional calculations. Many pairwise comparisons are shown to already have significant difference in the first stage (Figure 5). Take Multilign versus RNAalifold for example, Multilign is significantly better than RNAalifold for both sensitivity and PPV with the original 100 sequences. For some cases, the iterative sampling method does refine the test to be able to draw conclusions. For example, using the first 20 groups of sequences, the a priori power analysis estimates that 142 groups of sequences are needed to resolve the hypothesis of comparing the sensitivity between Multilign and Dynalign. This is much more than the 80 groups eventually used in the sequential test, which demonstrates that Multilign does have a higher average PPV than Dynalign. There are, however, also many inconclusive comparisons. For example, the null hypothesis that MASTR is as accurate as RNAalifold cannot be rejected, and it cannot be accepted either, for the inability to reject the null hypothesis may be due to the small power (0.18 and 0.20 for sensitivity and PPV), not because they truly have the same performance. Unfortunately, the available 5S rRNA sequences run out before sufficient power can be achieved to make a conclusion.
Because of the natural variability between measurements and the limitations on sampling, one cannot conclude that a method is better than another merely on the basis of means, especially when the difference in means is small. In this article, a statistical analysis is introduced to test the significance of improved RNA secondary structure prediction accuracy. It shows that the performance of algorithms can be correlated and a paired two-sample t-test can be used to compare the performance of two algorithms in terms of sensitivity or PPV. Even if there is no relationship between two RNA folding methods, a paired t-test is still justified because both methods are tested on the same set of sequences and thus their prediction scores are matching pairs. Another caveat is that a correlation coefficient of zero does not necessarily mean the two sets are uncorrelated. An example is the parabola function y=x2, for which both the Spearman rank correlation coefficient and Pearson product–moment correlation coefficient are zero for a sample centered on x=0, although there is perfect quadratic relationship between x and y.
Finally, a pipeline of evaluation is provided as illustrated in Figure 4. After the development and implementation of a new algorithm, its performance can be compared to another algorithm with a sequential procedure illustrated in the flow chart. If no conclusion can be made in the ith stage, the procedure moves on to the next stage with more sequences added into the benchmark. The sequential test used here has the advantages of reducing the total number of sequences required for the statistical test. In reality, with the broad range of the scores, the small differences between the average performances, and limited number of sequences available, it is often the case that which of two methods is better is statistically unknown. Because, however, the performance difference is usually small, the two methods may perform similarly in a practical sense on real data whether they are statistically different or not.
Although all the analyses above were performed on sensitivity and PPV, they essentially apply to other scores, such as MCC or the alignment score (sum of pairs score) without any modifications.
The method presented here, in Figure 4, provides the means for assessing whether the structure prediction performance of a program is significantly better than another on a benchmark set of sequences. It does not, however, determine whether one program is actually better than another. One first reason is that the benchmark set of sequences needs to be well chosen to represent the actual way the programs will be applied. A second reason is that programs may perform better on some families of sequences than on others, so a statistically better performance on one family of sequences may not result in better performance on a different family of sequences.
Supplementary Data are available at NAR online: Supplementary Tables 1 and 2, Supplementary Figure 1, and Supplementary methods.
National Institutes of Health (grant R01HG004002 to D.H.M.). Funding for open access charge: NIH.
Conflict of interest statement. None declared.
The authors thank Holger Hoos, University of British Columbia, and Charles Lawrence, Brown University, for insightful comments.