RNA once is considered as the fundamental information medium in central dogma of molecular biology. A number of studies have indicated that RNAs play a more active role and carry diverse functionalities in nature, including mediating the synthesis of proteins, regulating cellular activities, and exhibiting enzyme-like catalysis and post-transcriptional activities. Furthermore, many recent discoveries have shown that the number and biological significance of functional RNAs has been underestimated. In living cells, RNAs do not remain in a linear form, which folds its secondary structure through base pairs including canonical bonds of A-U and G-C and wobble pair of G-U. For understanding RNA's functionality, the alignment and similarity of RNA should consider not only the primary structure (sequence) but also the secondary structure (base pairs).
Numerous approaches were proposed to measure the similarity between RNA secondary structures, which can be broadly categorized into two classes: alignment based string or tree representation of RNA secondary structure, and comparison based some numerical representation without alignment.
Most studies usually adopt dynamic programming algorithms and tree models. Some are usually based on the alignment of a string representation of the secondary structures such as the dot-bracket representation, in which a score function or a distance function to represent insertion, deletion and substitution of letters in the compared structures [1
]. Sequences considered in alignment of RNA secondary structures are not only string sequences but also secondary structure. Different weights or different score functions are designed for unpaired nucleotides and paired nucleotides.
Others are almost based on alignment of a tree representation of the RNA secondary structure elements or the base pairing probability matrices [5
]. Shapiro [5
] proposed various tree models used for representing RNA secondary structures without pseudoknots.
Each tree model offers a more or less detailed views of an RNA structure. Given the tree representations of two RNA secondary structure, one comparison way is based on the computation of the edit distance between the trees while the other focus on the alignment of the trees using the score of the alignment as a measure of the distance between the trees. Popular tools for optimal alignment of RNA secondary structures include RNAdistance [6
] and RNAforester [8
] etc. RNAdistance compares RNA secondary structures based on tree edit distance measure, while RNAforester computes the pairwise or multiple alignment of structures based on tree alignment measure. Hofacker [9
] measured RNA secondary structures in terms of the base pairing probability matrices computed by McCaskill's partition function algorithm [10
]. The popular tool based matrix of base pairing probabilities is RNApdist, which was implemented as part of the Vienna RNA package.
Because the above methods rely on dynamic programming algorithms, they are computation-intensive even if the pseudoknots are ignored. For example, the Sankoff's algorithm [11
] simultaneously allows the structure prediction and alignment problem with O
) in memory and O
) in time for two RNA sequences of lengthn. So these algorithms are still impractical for long RNA sequences. Recently some comparison algorithms without aligning them are proposed. Kin [12
] gave a kernel method based on Stochastic Context Free Grammar (SCFG).
The graphical representations of biosequences (protein, DNA and RNA) could be out of the mainstream but a new research view and tool to understand and analyze such biosequences. M.Randic [13
] reviewed the sufficient materials on related topics of graphical representations of protein, DNA and the secondary structure of RNA. Inspired by several graphical representations of DNA sequences [14
], some researchers have proposed 2D, 3D or 4D graphical approaches for the representations of RNA secondary structure and then derive some numerical invariants and different graphical measures from graphs to compare RNA secondary structures [19
], eight symbols of the unpaired bases A, C, G, U and paired bases A′, C′, G′, U′ were used to code RNA secondary structures as graphical representations. In [31
], the representations of eight letters have been demonstrated to be approximate and have some loss of information. In [32
], 12 symbols have been used to represent RNA secondary structure without loss of information, in which the key is to discriminate between the first and the second base of a hydrogen bond for the paired bases.
In this paper, motivated by DV curve representation of DNA sequences [33
], we propose a novel triple vector curve representation of RNA secondary structure. With this novel representation, a new RNA secondary structure similarity measure based on wavelet analysis is designed, which can simultaneously focus on the local structure and global structure. To evaluate our algorithms, we take the classification of non-coding RNA and RNA mutation as examples to compare to the two popular tools of RNAdistance and RNApdist.