Since MODENA is a ‘population-based’ algorithm based on GA, multiple solutions can be obtained at 1 run. In the present study, we use a GA population size
Np = 50 to evaluate the performance of MODENA. MODENA with this setting outputs at most 50 successfully designed RNA sequences at one run; we refer to the designed RNA sequence which has a similarity score of 1.0 as a ‘successfully designed RNA sequence’, and we refer to a run which outputs the successfully designed RNA sequence as a ‘successful run’.
12 A GA population size = 100 was also used in order to compare the results obtained with population sizes of 50 and 100. Throughout the present paper, we use a maximum GA iteration number equal to the population size (eg, we use a maximum iteration number = 50 when using a population size = 50). Performance comparison between MODENA and previous methods is not straightforward, since the previous methods are ‘nonpopulation-based’ algorithms, which output one solution at 1 run. To compare the performance of MODENA with those of nonpopulation-based algorithms (INFO-RNA 2.1 and RNAinverse of Vienna RNA Package 1.8.3), we performed 50 independent runs for each target structure when evaluating the nonpopulation-based methods. Independent multiple runs of a nonpopulation-based method can give multiple solutions different from each other, since the nonpopulation-based methods used in our performance comparison are stochastic algorithms.
Comparison of computational times of MODENA and the previous nonpopulation-based algorithms is also not straightforward. When we performed 50 independent runs of each nonpopulation-based algorithm for a target structure, we calculated ‘the expected time
Et required for finding a solution’, which was used in the previous studies.
11,
12 The
Et is defined as follows:
where Es is the average time of successful runs, Eu is the average time of unsuccessful runs, and fs is a rate of successful runs ( fs = the number of successful runs divided by the total number of independent runs). The Et gives an expected computational time for obtaining a successfully designed RNA sequence. When there is no successful run (ie, fs = 0), the Et and the computational time of MODENA are denoted as ‘−’ in our results. We discuss the computational speed of MODENA and other methods by comparing the computational time of 1 MODENA run and the Et of other methods.
An approximate set of weak Pareto optimal solutions
shows an example of how the solutions computed by MODENA evolve to the successfully designed RNA sequences. In the figure, we plot the solution distributions of the initial, third and final populations obtained in the GA iterations. In the initial distribution of this example, there are no successfully designed RNA sequences, ie, no solution has a similarity score of 1.0. These rather randomly distributed initial solutions successfully evolved to an approximate set of weak Pareto optimal solutions after 50 GA iterations, where 23 solutions with a wide range of stability scores were successfully designed. Moreover, we observed a fast convergence in this example; we obtained a solution distribution almost similar to that of the final population even at the tenth GA iteration (data not shown).
The results for the Rfam dataset
The performance evaluation results for the Rfam dataset are summarized in , where RNAfold was used as a direct problem solver of MODENA. In this table, the results for INFO-RNA and RNAinverse are also shown for the performance comparison. In the target structures taken from the annotations of the 29 RNA families in Rfam, MODENA successfully designed RNA sequences for 23 RNA families. This result is better than those obtained by INFO-RNA and RNAinverse, by which the RNA sequences of 17 and 13 RNA families were successfully designed, respectively. The free energy distributions for the successfully designed RNA sequences are shown in , and , where the results for the RNA families successfully designed by all methods (MODENA, INFO-RNA and RNAinverse) are represented. As can be seen from the figures, the RNA sequences designed by MODENA have much wider free energy ranges compared with those of INFO-RNA and RNAinverse. There are 2 possible reasons for this result. The first is the local search nature of INFO-RNA and RNAinverse. Since INFO-RNA and RNAinverse explore the region near to the initial RNA sequences, the initial guess can strongly affect the final RNA sequence. Interestingly, RNAinverse and INFO-RNA consistently outputs high free energy solutions and low free energy solutions, respectively, in our results. These results clearly correspond to the sequence initialization algorithms of the methods; RNAinverse uses a purely random compatible sequence as an initial sequence, while IFNO-RNA generates an initial sequence with a very low free energy by a dynamic programming. The second is the MOGA used in MODENA. MODENA is developed based on NSGA2; NAGA2 takes niching into account in the optimization procedure, which encourages the solutions in a population to be diverse in the objective function space. As a result, the approximate set of weak Pareto optimal solutions computed by MODENA tends to expand not only to a high stability score side but to a low stability score side. This leads to a wider free energy range of the RNA sequences designed by MODENA.
| Table 1The results for the Rfam dataset |
In almost all RNA families included in , and , the free energies of the natural RNA sequences corresponding to the target structures are located within the free energy range of MODENA, whereas the energy ranges of RNAinverse and INFO-RNA do not include the energies of the natural ones in almost all RNA families of , and .
To test the initial random number dependence of MODENA, we performed an additional 3 independent runs for the Rfam dataset with different initial random numbers. As a result, we found that the random number dependence of MODENA is very small, where only 1 run for RF00003 in an additional runs failed and the other RNA families, which were successfully designed in , were successfully designed in the additional 3 runs. It is noteworthy that we obtained 1 successful design of RF00028 in an additional run whereas design of this RNA family failed in .
As can be seen from , INFO-RNA is much faster than MODENA in many cases. Computational times of MODENA and RNAinverse are comparable.
To test the performance of the MODENA invoked with an option of a non-MFE direct problem solver (CentroidFold),
5 we performed the RNA inverse folding with the Rfam dataset, where a population size of 50 was used. As a result, we successfully designed RNA sequences for 20 RNA families out of the 29 RNA families of the Rfam dataset. Since this result is slightly worse than that obtained with the MFE direct problem solver (RNAfold), we performed larger calculations with the CentroidFold direct problem solver on the failed 9 target structures of the Rfam dataset, where we set both a GA population size and maximum iteration number to 100. These larger calculations improved the results, where two out of the nine target structures were successfully designed. These results imply that RNA inverse folding with a non-MFE direct problem solver is more difficult than that with a MFE direct problem solver.
Why we explore weak Pareto optimal solutions
In the Pareto optimal solutions for the RNA inverse folding problem, where the stability score and similarity score are used as objective functions, there exists only 1 solution (eg, solution A in ) with a similarity score of 1.0 which folds into the target structure; this is because the other solutions (eg, solution C in ) with a similarity score of 1.0 (and a lower stability score) are dominated by the single Pareto optimal solution with a similarity score of 1.0. This indicates that the stability score of the Pareto optimal solution with a similarity score = 1.0 is highest in the stability scores of all RNA sequences folding into the target structure. The other solutions (eg, solution E in ) in the Pareto optimal solutions have a similarity score σ < 1.0.
It is noted that even when we successfully find the single solution which has a similarity score of 1.0 and the best stability score in all RNA sequences folding into the target structure, our search does not finish. This is because the other solutions (eg, solution C in ) with a lower stability score and a similarity score of 1.0 usually exist and they are also acceptable as solutions of RNA inverse folding. Since not only very stable RNA structures but also those with a lower stability can be candidates for artificial functional RNA sequences, it is important to develop a methodology that can design the RNA sequences with a wide range of free energies. To obtain a set of the RNA sequences that fold into the target structure and have a wide range of stability score, we can utilize weak Pareto optimal solutions. In weak Pareto optimal solutions of the RNA inverse folding problem, multiple solutions are allowed to have a similarity score of 1.0 in contrast to the case of the Pareto optimal solutions, where only 1 solution is allowed to have a similarity score of 1.0 (in , solutions A and C are weak Pareto optimal solutions). Thus, weak Pareto optimal solutions can give more a comprehensive solution set for RNA inverse folding. Since it is difficult to obtain the complete set of weak Pareto optimal solutions, we explored the approximate set of weak Pareto optimal solutions in the RNA inverse problem by using a framework of MOGA.