|Home | About | Journals | Submit | Contact Us | Français|
Computational RNA secondary structure prediction approaches differ by the way RNA pseudoknot interactions are handled. For reasons of computational efficiency, most approaches only allow a limited class of pseudoknot interactions or are not considering them at all. Here we present a computational method for RNA secondary structure prediction that is not restricted in terms of pseudoknot complexity. The approach is based on simulating a folding process in a coarse-grained manner by choosing helices based on established energy rules. The steric feasibility of the chosen set of helices is checked during the folding process using a highly coarse-grained 3D model of the RNA structures. Using two data sets of 26 and 241 RNA sequences we find that this approach is competitive compared to the existing RNA secondary structure prediction programs pknotsRG, HotKnots and UnaFold. The key advantages of the new method are that there is no algorithmic restriction in terms of pseudoknot complexity and a test is made for steric feasibility. Availability: The program is available as web server at the site: http://cylofold.abcc.ncifcrf.gov.
The variety of biochemical functions that are being carried out by RNA molecules is mesmerizing. Many RNAs such as ribosomal RNA, RNAase P or tRNA attain a defined secondary and tertiary structure that is vital to their function. Experimentally determined structures are only available for a small fraction of RNAs that are of interest. This makes the computational prediction of the base-pairing pattern (the secondary structure) of RNA an important problem. One major breakthrough was the development of dynamic programming algorithms that could predict the minimum free energy secondary structure of RNA sequence assuming that the structures are non-nested (1–5). Subsequently, dynamic programming algorithms have been extended to allow certain classes of pseudoknots (6,7).
Many RNA secondary structure prediction algorithms (including the one presented here) are based on the idea of iteratively adding substructures to an initially unfolded sequence (8,9). Genetic algorithms are an example of such algorithms and have proven very useful for exploring pseudoknotted structures and sub-optimal RNA structures (10–14).
Allowing pseudoknots is desirable simply because RNA structures determined by X-ray crystallography or NMR revealed that many RNAs contain non-nested base pairing interactions. Allowing all possible base pairing interactions leads to the potential problem for structure prediction approaches that not only are there many more conformations to consider, but also many conformations are not sterically feasible. Here, we describe a computational approach for RNA secondary structure prediction that has no restriction in terms of pseudoknot complexity, but additionally checks the steric feasibility of the considered conformations.
The described approach of RNA secondary structure prediction is based on the idea of maximizing matching helices in a secondary structure (10). A flow chart of the algorithm is shown in Figure 1. Briefly, the method works as follows: Initially, a list (called a stem-list) of all possible helices with more than 3 bp is generated. Helices can contain Watson–Crick and GU–wobble base pairs. The secondary structure prediction is performed by picking the best-scoring structure obtained after 50 folding simulation runs. The score is set to be the sum of the free energy contribution of the already placed helices. Each folding simulation run is performed by picking helices from the stem list with a Boltzmann-weighted probability. Estimating the free energy contribution of an RNA double-helix is accomplished using the RNA Vienna package (2). Each chosen helix is represented by a very coarse-grained 3D representation in a virtual 3D workspace. An RNA double helix is represented by a cylinder (using a radius of 6.5 Å and a length of 2.7 Å times the number of base pairs) that is capped with a half-sphere on both ends. This shape is called a capsule. A schematic diagram of the mapping of an RNA secondary structure into a highly coarse-grained 3D representation is shown in Figure 2. The main reason for choosing capped cylinders over regular cylinders is the computational efficiency of collision detection. Single stranded regions between helices are represented as constraints for the maximum distance between the ends of the capped cylinders. A newly chosen capped cylinder is placed into the 3D simulation space at a random position such that the distance-constraints are fulfilled. The distance constraints are a function of the single-stranded sequence lengths between connected helices. The maximum distance between helix ends is 2.0 Å + n*8.0 Å with n being the sequence separation. The minimum distance is 2.0.
If cylinders collide, the newly placed capped cylinder is placed at a different random position. If after 20 attempts the newly placed capped cylinder is still colliding with previously placed capped cylinders, the positions of all capped cylinders are optimized in order to minimize collisions and constraint violations. If no collision-free position can be found, the newly chosen helix and its capped cylinder representation is discarded. Otherwise, the found collision-free position is stored. Helices that are part of the stem-list and that share bases with the newly placed helix are removed from the stem-list. In the next iteration the next helix is chosen until no more helices can be placed. Once no more helices can be placed, one simulation run is completed. Fifty simulation runs are performed and the overall best-scoring structure is returned to the user.
The folding algorithm is implemented as a C++ program. The web server has been implemented using the Grails framework (18), which is based on the Groovy programming language. For a secondary structure prediction request, the web server launches the cylofold binary on a Linux compute cluster. After the prediction result has been generated, the program VARNA (19) is launched to generate an image of the secondary structure prediction. The prediction results are temporarily stored in a relational database.
A user of the CyloFold prediction web server can start a secondary structure prediction request by entering (‘pasting’) a nucleotide sequence (as raw characters or in FASTA form, both ACGU and ACGT alphabets are accepted) into the web form and pressing ‘submit’. The maximum sequence length that is currently accepted by the web server is 300 nt. The initial return of the web server is a unique id, which is needed if one wants to access results at a later time. Due to the compute-intensive approach for the prediction, it can take several minutes for the server to finalize a secondary structure prediction. The user can access the results by one of three methods: a simple ‘reload’ of the initial result page will update the status of the prediction and will eventually contain the prediction results. Alternatively, the user can bookmark the initial result page in the web browser and return to it at a later time. Lastly, the unique id provided after submitting the secondary structure prediction compute request can be used to access the results using another web form available on the server home page.
A typical output from a completed RNA structure prediction is shown in Figure 3. The prediction result is presented to the user in three different formats: (i) An image of the predicted RNA secondary structure created by VARNA (19); (ii) An extended bracket notation in which nested base pairs are denoted as pairs of nested parentheses and helices corresponding to pseudoknot interactions are denoted as letters; (iii) The ‘CT’ file format that is also generated by other programs such as mfold (5). This format contains a list of the indices of the bases and their predicted base-pairing partners.
The performance of the new RNA secondary structure prediction method was evaluated using two different data sets. Data set 1 (corresponding to the results shown in Table 1) consists of 26 RNA sequences, whose tertiary structure is available in the Protein Data Bank (PDB). The reference secondary structure was obtained by extracting the base pair information from the PDB coordinate file using the program RNAview (20). Data set 2 consists of 241 RNA sequences and secondary structures originating from PseudoBase (21,22).
In order to quantify the time-complexity of the folding method, we fitted a function of the form a*Nb (with N being the number of residues in the input sequence) to the execution time needed for the cases of the 241 sequence set. We found that the execution time (measured in seconds) of the structure prediction is well described by the function 2.74*10−8*N4.47. The timing evaluation was performed on a computer with 4 GB of RAM and an Intel 64-bit Xeon processor (3.0MHz).
We report in Tables 1 and and22 prediction results for these two data sets together with the corresponding results obtained by running the RNA secondary structure prediction programs HotKnots 2.0 (8), pknotsRG (7) and UNAFold (23).
The average Matthews correlation coefficient (MCC) obtained by comparing the base pairing pattern of the predicted secondary structures with their respective reference secondary structure is for data set 1 and CyloFold 0.83; this can be compared to pknotsRG (0.82), HotKnots 2.0 (0.75) and UNAFold (0.73) (see row of Table 1 named ‘All’).
We divided this data set into two subsets according to the fraction of pseudoknot base pairs in the respective structures. The results can be seen in the last two rows of Table 1. The eight PDB structures with <5% pseudoknotted base pairs correspond to an average MCC of 0.87 for CyloFold compared to 0.81 for pknotsRG, 0.82 for HotKnots 2.0 and 0.87 for UNAFold. The 18 structures listed in Table 1 that have a pseudoknot amount >5% correspond to an average MCC of 0.81 for CyloFold, 0.82 for pknotsRG, 0.73 for HotKnots 2.0 and 0.66 for UnaFold.
Using the larger data set 2, one obtains an average MCC of 0.752 for CyloFold and 0.748 for pknotsRG (Table 2). In Table 2 one can see the RNA secondary structure predictions obtained by CyloFold correspond to the highest MCC (compared to the programs pknotsRG, HotKnots 2.0 and UNAFold). It also has the highest average base pair prediction sensitivity (0.763). For another measure, the positive predictive value (how often are predicted base pairs part of the reference secondary structure), all programs obtain averages between 0.68 and 0.76 for data set 2 with pknotsRG leading with a value of 0.756. It should be noted that the MCC is often used as an overall measure of prediction quality, while sensitivity, specificity and positive predictive value capture certain other aspects of the prediction quality.
These results indicate that the prediction accuracy of CyloFold compared to pknotsRG is similar. The key advantage of CyloFold is that there is no restriction in terms of the classes of pseudoknots that are being considered. Also, it should be noted that the employed model of simulated RNA folding by placing helices with a probability according to their free energy contribution is in essence very simple (24). In that sense it is surprising how well the method performs, and it should be an encouragement to continue to develop RNA folding algorithms that are substantially different from established approaches.
CyloFold is a new method for RNA secondary structure prediction. We show using two different data sets that the prediction accuracy (MCC) is comparable to the RNA secondary structure prediction program pknotsRG. The search algorithm has no restriction in terms of pseudoknot complexity. Another novel aspect is that at each step during the simulated folding process, the steric feasibility of the predicted structures is checked for steric feasibility using a highly coarse-grained 3D representation. The method is made available in the form of a user-friendly web server.
This project has been funded in whole or in part with federal funds from the National Cancer Institute, National Institutes of Health, under contract HHSN26120080001E. This Research was supported by the Intramural Research Program of the NIH, National Cancer Institute, Center for Cancer Research. Funding for open access charge: National Cancer Institute.
Conflict of interest statement. None declared.
We wish to thank the Advanced Biomedical Computing Center (ABCC) at the NCI for their computing support. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.