RNA plays many important biological roles other than as a transient carrier of amino acid sequence information. It catalyzes peptide bond formation [1
], participates in protein localization [3
], serves in immunity [4
], catalyzes intron splicing and RNA degradation [5
], serves in dosage compensation [6
], is an essential subunit in telomerase [7
], guides RNA modification [8
], controls development [10
], and has an abundance of other regulatory functions [12
Non-coding RNAs (ncRNAs) are transcripts that have function without being translated to protein. The number of known ncRNAs is growing quickly [15
], and their significance had been severely underestimated in classic models of cellular processes [18
]. It is desirable to develop high-throughput methods for discovery of novel ncRNAs for greater biological understanding and for discovering candidate drug targets.
However, novel ncRNAs are difficult to detect in conventional biochemical screens [19
]: they are frequently short [18
], often not polyadenylated [19
], and might only be expressed under specific cellular conditions [20
]. Experimental screens have found many ncRNAs [23
], but have demonstrated that no single screen is capable of discovering all known ncRNAs for an organism. A more effective approach, demonstrated in previous studies [25
], may be to first detect ncRNA candidates computationally, then verify them biochemically. Considering the number of available whole genome sequences [31
], this approach can be applied to a large and diverse dataset, and has massive potential for novel ncRNA discovery.
The effectiveness of a computational ncRNA detection/classification method is determined by measuring its sensitivity and specificity on a test set of known ncRNAs and negative sequences. Sensitivity and specificity are defined as:
where true positives are ncRNAs that are detected by the method, true negatives are sequences that are not ncRNA and are not classified as ncRNA by the method, false positives are sequences that are not ncRNA, but are classified as ncRNA by the method, and false negatives are ncRNAs that are missed by the method.
Generally, there is a tradeoff between sensitivity and specificity – tailoring a computational method to increase one measurement may decrease the other. Throughout this paper, receiver operating characteristic (ROC) curves are used to visually express the quality of a ncRNA classification method by plotting sensitivity as a function of the false positive rate (1 – specificity), providing a complete description of all possible sensitivity/specificity tradeoffs. It should be noted that in a whole genome screen, high specificity is more essential than high sensitivity due to the large ratio of non-ncRNA sequence to ncRNA sequence. Low specificity results in an overwhelming number of false positives, swamping the number of true positives, and increasing the difficulty, time, and cost of a biochemical verification screen.
It has been proposed that ncRNAs may form secondary structures that are more stable than would be expected from non-ncRNA sequences of the same nucleotide or dinucleotide composition [38
]. This hypothesis has been controversial; it has been suggested that it is not true, or at least that the stability difference is not statistically significant enough to be a sensitive and specific criterion for classifying sequences as ncRNA [19
] (also claimed on the basis of a small set of tRNA in [42
]). However, the program RNAz was recently reported [43
] to use folding free energy changes of single sequences, combined with a structure conservation index (SCI) determined from a fixed, multiple sequence alignment, to effectively detect ncRNA. The SCI is the ratio of the consensus secondary structure free energy change (which also includes terms rewarding mutations evidencing structure conservation) determined by RNAalifold [44
] to the average folding free energy change for each sequence determined alone. This indicates that incorporating secondary structure conservation into a model based on folding free energy change improves the quality of prediction.
Here, the effectiveness of the program Dynalign [45
] as a tool for detection of ncRNA on the basis of predicted folding free energy change is investigated. Dynalign is a dynamic programming algorithm for simultaneously computing the lowest free energy common secondary structure and the structural alignment for two sequences. In brief, Dynalign minimizes ΔGtotal
ΔG°total = ΔG°1 + ΔG°2 + (number of gaps in alignment) × ΔG°gap penalty [eq. 3]
are the predicted folding free energy changes of secondary structures of sequence 1 and sequence 2, respectively, and ΔG°gap penalty
is a penalty applied for each gap in the alignment. Only conserved helices, i.e. those that appear in both sequences, are predicted. The conformational free energy changes are predicted using an empirical nearest neighbor model [47
] and ΔG°gap penalty
was empirically determined by maximizing structure prediction accuracy [45
]. Dynalign predicts secondary structure with significantly greater accuracy than single sequence structure prediction methods because of the additional information contained in the structural alignment [45
]. It requires no sequence identity between the two sequences to perform well because there are no energy terms (equation 3) that address sequence identity. Therefore, Dynalign is robust for cases in which extensive covariation of base-paired nucleotides exists as a result of sequence evolution.
Dynalign is initially implemented in this paper as a computational ncRNA classifier by using it to compute the ΔG°total of an input sequence pair, then comparing that value to the mean of ΔG°totals of control sequence pairs generated specifically for that input pair. If the ΔG°total of the input sequence pair is sufficiently lower than the mean ΔG°total of the set of controls, the input sequences are classified as ncRNA. The z score is used to quantify this difference, defined as:
z = (x - μ)/σ [eq. 4]
where x is the ΔG°total of the input sequence pair, and μ and σ are the mean and standard deviation of the ΔG°totals of sequence pairs in the control set, respectively. Therefore, the z score is just the number of standard deviations that the ΔG°total of the input sequence pair is above or below the mean of its set of controls.
It should be noted that the definition of z
score implies that the control set values follow a normal distribution, but it has been noted that the distribution of ΔG°s for single sequences is actually extreme value with skew towards lower folding free energies [19
]. Tests (data not shown) suggest that the distributions of ΔG°total
s of sequence pairs in control sets are also skewed towards lower free energies. However, the z
score is an effective measure for classification and has been used in this manner elsewhere [19
This approach is tested on a large database of known 5S rRNA and tRNA sequences and artificially generated negatives, demonstrating that the z
score based on the ΔG°total
can be used as a sensitive and specific classification measure. These results are also compared to RNAstructure [49
], a dynamic programming algorithm for single sequence secondary structure prediction by free energy minimization. Also, a support vector machine (SVM) is implemented to speed the classification process by training an SVM classifier that does not require a control set for an input sequence pair.
Additionally, the capability to use Dynalign as an effective genomic ncRNA screening tool is illustrated with a whole genome screen on a crude alignment of the Escherichia coli
and Salmonella typhi
], which contain a significant number of known ncRNAs. Many methods have been employed for genomic screens for ncRNAs of specific families [50
]; benchmarks and discussion in this paper are focused on the premise of using Dynalign as a general genomic screening tool for diverse, novel ncRNAs.
The above tests are benchmarked against two leading ncRNA prediction programs, QRNA [59
] (version 2.0.2c) and RNAz [43
] (version 0.1.1). RNAz uses a regression SVM to compute a z
score for each sequence in a multiple sequence alignment, then uses the mean of those z
scores and the SCI as input to a classification SVM. While structure predictions by Dynalign and RNAz are based on calculating the most stable secondary structure using experimentally determined thermodynamic parameters, QRNA uses a fully probabilistic covariance analysis approach that compares scores of three models – ncRNA, open reading frame, or other (null hypothesis) – for a pair of sequences.
Unlike Dynalign, which optimizes its own structural alignment, both QRNA and RNAz require a fixed sequence alignment as input. It is shown here that at low pairwise sequence identity, the Dynalign approach outperforms the fixed alignment approach. Additionally, Dynalign is shown to be a more sensitive ncRNA finder on whole genome screen tests.