We use 11 RNAs tested by both BARNACLE and FARNA to benchmark our method TreeFolder. These RNAs contain 12~46 nucleotides and are not homologous to any structures in our training dataset. In case an RNA has multiple NMR structures, we use the first structure in the PDB file as its native structure.
It is not very reliable to compare two methods simply using the decoys with the lowest RMSD, since they may be generated by chance and also depend on the number of decoys to be generated. The more decoys are generated, the more likely the lowest-RMSD decoy has lower RMSD from the native. Therefore, a better strategy is to compare the RMSD distributions of decoys.
Our TreeFolder generates better decoys than FARNA: we compare FARNA and TreeFolder in terms of the quality of the decoy clustering centroids. Similar to FARNA clustering only on the top 1% decoys with the lowest energy, we run MaxCluster to cluster the top 1% of our decoys with the lowest energy into five clusters. As shown in , TreeFolder can generate decoys with better cluster centroids for nine RNAs: 1a4d, 1esy, 1kka, 1q9a, 1xjr, 1zih, 28sp, 2a43 and 2f88. By the way, even if a significantly smaller number of decoys is generated by us, the lowest RMSD decoys by our TreeFolder for 1a4d, 1zih and 28sp still have smaller RMSD than those by FARNA.
Comparison between FARNA and our method TreeFolder
Our TreeFolder generates better decoys than BARNACLE
: displays the 5% and 25% quantiles of the RMSD distributions for decoys generated by BARNACLE and TreeFolder. The quantiles by BARNACLE are taken from Supplementary Table S4
in Frellsen et al. (2009
). BARNACLE considers only decoys with energy <1, since this kind of decoys are likely to have more correct base pairings. We use exactly the same energy function as BARNACLE, so we also consider only decoys with energy <1 to ensure a fair comparison. We did not generate as many decoys as BARNACLE and thus for some test RNAs we do not have many decoys with energy <1. In this case, we use decoys with energy <2. On the 10 RNAs shown in , TreeFolder yields better RMSD distributions for eight of them: 1esy, 1kka, 1q9a, 1qwa, 1xjr, 1zih, 28sp, 2a43 and 2f88.
The 5 and 25% quantiles of the RMSD distributions for decoys generated by our method TreeFolder and BARNACLE
Sequence information is important for RNA conformation sampling: different from other two state-of-art methods, FARNA and BARNACLE, our TreeFolder makes use of sequence information to significantly improve conformation sampling, as measured by the median RMSD values of decoys. The result is shown in , in which we compare two CRF models: one using sequence to sample conformations and the other not. Without using sequence information, our CRF method is similar to BARNACLE. That is, it models only angle state transitions in a RNA structure. Both CRF models use 50 conformation states. For the CRF model without sequence features, the regularization factor is set to 5 (i.e. λ=5). While for the CRF model utilizing sequence information, the regularization factor are set to 5 and 10 (i.e. λ=5, μ=10). To calculate the median RMSD, for each RNA we generate 300 decoys using the two CRF models.
Comparison between the CRF models using or without using sequence information
Sampling real-valued angles generates better decoys: in order to show the detailed difference between our TreeFolder and FARNA, we look into the decoys of 1esy. We choose it because that FARNA and TreeFolder yield the largest difference on this RNA among all the 11 tested RNA molecules. As shown in . TreeFolder can generate a much larger percentage of decoys with RMSD <4 Å than FARNA. We also compute local RMSD of each position in the decoys, which is defined as the RMSD of the segment of four consecutive nucleotides starting with this position, as compared to the native structure. We calculate the correlation between the local RMSD of each position with the global RMSD, as shown in . Among the decoys generated by both FARNA and TreeFolder, the local RMSD at position 13 has the highest correlation with the global RMSD. We also calculate the angle error at each position by Error=‖v−v0‖2 , where v is the angle vector of a decoy at one position and v0 is the native angle vector at the same position.
The RMSD histograms of the 3000 decoys generated by our method TreeFolder (A) and FARNA (B) for 1esy.
Correlation between the local RMSD at each position and the global RMSD. The X-axis is the start position of a segment.
shows the angle error histograms in three positions 13, 14 and 15. The angles at these three positions determine the conformation of the segment starting at position 13. At positions 13 and 15, the angle errors by our method TreeFolder are significantly smaller than those by FARNA. As shows, the angle errors by FARNA are distributed around several separated peaks, which may be caused by the limited number of fragments used in FARNA. In contrast, the angle errors by TreeFolder are distributed more smoothly, possibly because we can sample real-valued angles.
The angle error histograms at positions 13, 14 and 15. At positions 13 and 15, the decoys by our TreeFolder have much smaller angle errors than those by FARNA.
Folding RNA using predicted secondary structures
: we use the secondary structures predicted by CONTRAfold (Do et al., 2006
) and sample 1000 decoys for each RNA. The quantiles of their RMSD values are shown in . On 6 of the 10 tested RNA, decoys generated from native secondary structures are better than those from predicted secondary structures. On the other four RNAs, the difference between the two types of decoys is small, because of accurate secondary structure prediction. The results for 1l2x and 2a43 from predicted secondary structures are quite bad, since all of their base pairs are contained in a H-type pseudoknot and only half of their base pairs are recovered by CONTRAfold. However, our TreeFolder generates decent conformations for half of the pseudoknot with predicted base pairs, as shown in brackets. In particular, TreeFolder generates decent structures for 2a43 from nucleotides 1 to 14 and for 1l2x from nucleotides 1 to 18, respectively. In order to improve sampling performance on the whole structures of 1l2x and 2a43, we need an energy function like what is used in FARNA to guide the folding simulation.
Comparison between folding with native and predicted secondary structure
Comparison with MC-Sym on the large RNA molecules
: our TreeFolder is much faster than the MC-Fold and MC-Sym pipeline (Parisien and Major, 2008
) for folding large RNA molecules, as shown in . The running times in this table were obtained on a workstation with 96 GB RAM and 24 computing cores [2.67 GHz Intel(R) Xeon(R)].
Running time comparison between MC-Sym and our TreeFolder on large RNA molecules
Overlay examples: shows three overlay examples of 1q9a, 2a43 and 1xjr with length of 27 nt, 26 nt and 49 nt, respectively. Pictures in blue display native, while in red the best centroids produced by our algorithm. As shown in this figure, our algorithm recovered a pseudoknot for 2a43.
Overlay representation of the best centroids (red) of 1q9a, 2a43 and 1xjr (from left to right) with their native structures (blue). These three RNA molecules have lengths of 27 nt, 26 nt and 49 nt.