Post-transcriptional regulation of genes and transcripts is a vital aspect of cellular processes and unlike transcriptional regulation, remains a largely unexplored domain. One of the most obvious and most important questions to explore is the discovery of functional RNA elements. Many RNA elements have been characterized to date ranging from cis-regulatory motifs within mRNAs to large families of non-coding RNA such as pre-miRNAs, snRNAs, snoRNAs, gRNAs, tRNAs, rRNAs, and assorted ribozymes. Like protein coding genes, the functional motifs of these RNA elements are highly conserved, but unlike protein coding genes, it is most often structure and not sequence that conserved. Proper characterization of these structural RNA motifs is both the key and the limiting step to understanding the post-transcriptional aspects of the genomic world.
Here we focus on the task of structural motif discovery and the informatics resources and tools geared towards this task. We present first the existing databases of RNA structures and their known instances (). These range from databases of directly imaged 3D structures to ones where consensus structures have been compiled either manually from literature or by using a computational approach. They also include databases that catalog the result of genome-wide searches for conserved structures. Complementing these structure databases is a collection of tools for searching out instances of known structures in new sequences ().
Search Tools for Known Structural Motifs
We move on then to tools focusing on the discovery of new structural motifs from a set of related sequences. These are divided into two main families – ones that rely on pre-aligning the sequences (), and those that can work with unaligned sequences (). The first group includes notable covariance model based approaches as well as a smattering of classifier driven, Bayesian, thermodynamic, and aggregate approaches while the latter contains many improvements on the Sankoff algorithm for simultaneous sequence/structure alignment along with novel approaches such as shape-abstraction, suffix-arrays, genetic programming, and formal grammars. To aid in the comparison and benchmarking of these motif predicition algorithms, we also provide the two known attempts at compiling standardized datasets of motif-containing sequences (). The newer of these, TUTR, also contains matched control sets to help properly estimate sensitivity and specificity parameters for each algorithm.
Consensus Structures in Aligned Sequences
Consensus Structures in Unaligned Sequences
Benchmark Data for Consensus Structure Prediction
We have not included here the numerous tools for predicting structures in individual sequences, for predicting interactions between RNA structures, or those for folding sequences in a specific association context or with specific thermodynamic constraints. These go well beyond the task of motif prediction as it relates to families of functionally related mRNAs and ncRNAs.
Using the listed tools, it should be possible to survey the known space of functional RNA motifs, to search for known motifs in new sequences, and to discover new structure families in related sets of aligned and unaligned sequences. This should provide a good starting point for studies of post-transcriptional regulatory elements and non-coding RNAs.