Selecting highly effective siRNA sequences
Highly effective siRNA sequences are selected using an algorithm based on new guidelines developed by Ui-Tei et al
; Figure ). Users can specify additional sequence conditions required for proper transcription initiation and termination (3
) for designing short hairpin RNA, GC contents and custom rules.
The simple condition of requiring AA at the beginning of the target site is widely utilized to design siRNA sequences (2
). This rule, however, frequently misses effective siRNA sequences that follow our four guidelines. Moreover, demanding AA at the beginning of the target site severely restricts the selection of possible siRNA sequence candidates that minimize off-target silencing effects. We therefore did not incorporate this rule into our system.
Reduction of off-target silencing effects
The minimization of off-target silencing effects calls for the selection of siRNA sequences that are guaranteed to have some mismatches to all unrelated sequences. We here use a rigorous specificity measure called the mismatch tolerance, the minimum number of mismatches between the siRNA sequence and any non-targeted sequence. For instance, an siRNA sequence of mismatch tolerance three does not match any off-target candidates with fewer than three mismatches. A higher mismatch tolerance of an siRNA sequence indicates its high specificity in the presence of some mismatches.
Exact mismatch-tolerance is costly to calculate, because it demands searching the entire sequence database to check whether individual siRNA oligos potentially cross-hybridize with irrelevant sequences. The Smith–Waterman local alignment algorithm (16
) may return accurate answers but is very time-consuming to execute. In contrast, BLAST (17
) is much faster than the Smith–Waterman algorithm, but it may overlook significant alignments.
BLAST may overlook off-target candidates
The following alignment illustrates such an example where BLAST fails to identify the similarity between two 19 nt sequences that match with three mismatches at the 5th, 10th and 14th positions, because the two sequences do not share seven contiguous base matches, which is the shortest word (consecutive nucleotides) that BLAST requires tofind hits.
Moreover, BLAST with its default parameter values may fail to notice best alignments with minimum number of mismatches when it receives such short sequences of 19 bases as input. For instance, using BLAST, we searched the ‘nr’ database for ‘ACCGCAGTATATGGTTCTG’, a 19 nt siRNA sequence candidate for NM_000014, and we received the partial answer that the first 15 nt matched a substring of NM_002864 at 100% identity. In fact, BLAST search overlooked the following best alignment with 18/19 matches.
This search failure is due to the default parameter values of BLAST. Since ‘match reward’ and ‘mismatch penalty’ are respectively set to 1 and −3, the occurrence of one mismatch demands at least four additional matches to extend the running alignment. A partial solution to fully extend such alignments is reducing the penalty of one mismatch, which, however, is likely to output numerous, low homologous alignments with off-target candidates.
To date, most existing websites for designing siRNA sequences, such as siRNA Target Finder at the Ambion website, siDESIGN Center at Dharmacon, siRNA Target Finder at GenScript (18
), and Gene specific siRNA selector (19
) use BLAST to search for off-target candidates. We should bear in mind that the limitations of BLAST in seeking optimal alignments may not minimize off-target silencing effects, as illustrated in the above example. In contrast, Qiagen utilizes SSearch, a rigorous Smith–Waterman search which is computationally costly. It was the lack of efficient, accurate software for enumerating potential off-target candidates that motivated us to develop an efficient method for computing mismatch tolerance (20
Non-redundant sequence set of genes
Another major issue to solve was the generation of a non-redundant sequence set of genes for checking the target specificity of siRNA sequences. Traditional non-redundant sequence datasets, such as UniGene (21
) and RefSeq (21
), are not suitable for this purpose, because alternative splice variants in these datasets share common exons that bring about duplication. Although searching for siRNA oligos on common exons is valuable for simultaneous silencing of all alternative splice variants, one siRNA oligo may hybridize to any of the redundant exons, calling for duplicate elimination to yield one representative exon so that siRNA oligos can be properly designed. In addition, it is also necessary to consider siRNA sequences that target the junction connecting two exons of a particular alternative splice variant. Thus, non-redundant sequences over exon–exon junctions together with duplicate-free exons ought to comprise the non-redundant sequence set of genes for checking target specificity (see Figure ).
Generation of the non-redundant sequence set of genes from alternative splice variants located on the same locus.
Since such a database was not available, we created one. First we aligned all the human RefSeq and Unique UniGene sequences onto the human genomic sequences (hg16). For each query sequence, we selected the best alignment that had >90% coverage ratio and >85% match ratio. We retrieved duplicate-free exons and sequences over exon–exon junctions. However, since some sequences were not totally aligned to the genomic sequence, due to sequence errors or the incompleteness of the genomic sequence, subsequences that failed to match were added to the non-redundant sequence set. The non-redundant sequence set of mouse genes was similarly generated.
The major benefit of using the non-redundant sequence set is that the mismatch tolerance of a ‘redundant sequence’, defined as a substring of more than one exon on the same locus of the genome, is likely to be higher in the non-redundant sequence set than in the original set of human RefSeq and Unique UniGene sequences. We will present the statistics below.
Selection of target-specific siRNA sequences from the non-redundant sequence set
If an siRNA sequence is designed according to our four guidelines for effective sequences, the siRNA antisense strand is thought to be incorporated into the RISC more efficiently than the sense strand. This property may simply allow us to select effective siRNA sequences by considering only the sense target, the complement of the siRNA antisense strand, within the non-redundant sequence set. Thus, for siRNA sequences, we define, in particular, the plus-strand mismatch tolerance as mismatch tolerance calculated by using only the sense target sequence. However, we cannot disregard the possibility that the siRNA sense strand is also incorporated into the RISC and causes off-target effects. Thus, it is more reliable to take both strands of siRNA sequences into consideration. The both-strand mismatch tolerance is defined as the minimum number of mismatched bases that allow the siRNA antisense or sense strands to match a non-targeted sequence in the non-redundant sequence set. There remains the question of how much the mismatch tolerance of an siRNA sequence ought to be in order to treat the siRNA sequence target-specific.
The non-redundant sequence set of human genes was analyzed to obtain a comprehensive understanding of mismatch tolerance distribution of the 19 nt sequences that occur in the non-redundant set. The statistics for 19 nt sequences in Figure A shows that 9.5% are both-strand mismatch tolerance three or four but there exist no sequences of both-strand mismatch tolerance five or more. The fraction doubles if plus-strand mismatch tolerance is recalculated as illustrated in Figure B. From these results, we anticipate that effective siRNA sequences of both/plus-strand mismatch tolerance three or four can be designed for most of mRNA sequences, and we define a sequence to be both-strand (plus-strand, respectively) specific if the both-strand (plus-strand) mismatch tolerance is three or more. In reality, Figure C shows that, for 96.3% of mRNA sequences in RefSeq, at least one effective both-strand specific siRNA sequence is designed. The fraction increases to 97.7% if the plus-strand specificity is considered instead.
Figure 3 (A) The vertical axis is the number of 19 nt sequences of the both-strand mismatch tolerance shown in the horizontal axis. The solid line is the distribution for the non-redundant sequence set. For comparison, the dotted line shows the distribution when (more ...)
Figure D verifies the usefulness of the non-redundant sequence set, because in the original sequence set of human RefSeq and Unique UniGene, most of redundant 19 nt sequences are both-strand mismatch tolerance zero, while, in the non-redundant sequence set, 10.7% of redundant 19 nt sequences are both-strand specific and are therefore mismatch tolerance three or more.
Figure illustrates the flowchart of siRNA sequence selection by our system. First, our web server accepts an arbitrary sequence or an accession number to retrieve its sequence. Subsequently, the query is processed to calculate effective, gene-specific siRNA sequences by searching the non-redundant sequence set for individual 19-nt sequences. To accelerate the computational performance, we precompute all the both/plus-strand specific siRNA sequences. This precomputation makes it possible to take just a couple of seconds to return the complete list of both/plus-strand specific siRNA sequences for a typical mRNA sequence (Figure B).
Figure 4 Flowchart of siRNA sequence selection by siDirect. (A) An arbitrary mRNA sequence is input. (B) Both/plus-strand specific precomputed siRNA sequences are presented in front. Both-strand specific (plus-strand specific, respectively) sequences (more ...)
Care has to be taken to select an siRNA sequence since it may cross-react with off-target candidates. Our web server provides further information to examine the off-target silencing effects of a specific sequence. Clicking on the siRNA sequence asks the system to search the non-redundant sequence set for all the potential off-target candidates with which the siRNA sequence might cross-react. This complete search sounds to be computationally costly, but our algorithm (20
) is capable of processing this request in less than a second. Subsequently, the server displays the alignment between each off-target candidate and the siRNA sequence in order to depict the locations of mismatches, which is useful in assessing the off-target silencing potential (Figure C).
We plan to update our web server in response to major revisions to the human genome, the mouse genome, the RefSeq database and the UniGene database, though it is inevitable that such renewals will retract existing siRNA sequence candidates or add novel ones.