CRT (version 1.0), Pilercr (version 1.0) [
17] and Patscan [
18] were compared based on execution speed and ability to correctly identify CRISPRs. Patscan is a generic pattern discovery application that identifies repetitive sequences given a user-specified input pattern. The number of repeats that Patscan detects must be predefined, and the tool has no mechanism for distinguishing repeating and non-repeating regions of CRISPRs. Thus, considerable manual processing of the output is required in order to remove unwanted results and to extend repetitive sequences beyond the fixed size limit set by the input pattern. Pilercr is a recently developed tool designed specifically for the automatic detection of CRISPR elements. It is based on the Piler [
20] program, which utilizes alignment matrices for detecting contiguous repeats.
Both Patscan and Pilercr were implemented in the C programming language. CRT was developed using Java. All tools were tested on finished microbial genomes available in the IMG version 1.5 database [
21]. Each was run under Cygwin version 1.5.21 on a PC having the following specifications: Windows XP operation system, Pentium 3.4 GHz processor, 1.0 GB RAM.
Speed Evaluation
Figures , and show the execution times of the three tools. The x-axis contains the accession number for the analyzed organisms followed by their approximate number of base pairs in millions (Mbp). As the number of repeats may affect execution time, only genomes with similar repeats counts were used (between 70 and 80). Figures and are based on a search with repeat size 21 – 37, spacer size 19 – 48, and minimum number of CRISPR repeats 3. CRT required an additional setting for search window length. It was tested for values 6 and 8. Figure shows results when searching for longer repeats of size 19 – 50 and spacer size 19 – 60 (this is beyond the range of any CRISPRs found in any previous work). Patscan is not included in this figure, or any subsequent figures, because it's extended running times flattens the other graph lines, making it difficult to compare the other tools (see Figure ).
The speed of CRT and Pilercr is very impressive and a huge improvement over the previous technique of CRISPR detection using Patscan, as shown in Figures and . CRT achieves the best performance, being able to process a DNA sequence of nearly 6 million bases in about 3 seconds using a search window of size 8, CRT(8), and in about 2 seconds for a search window of size 6, CRT(6).
Figure shows a slight decrease in the performance of CRT as the range in the size of repeats to be detected is increased (see the previous section). The performance of Pilercr, however, appears to be independent of the size range of repeats. For these settings, the speed of CRT(8) and Pilercr are about the same, with CRT(6) performing best.
In the previous example, execution speed was analyzed based on increasing genome size. In Figure , speed is analyzed for increasing number of repeats. Only genomes of similar sizes were used (2.7 – 3.8 Mbp). The repeat size is 21 – 37 and spacer size is 19 – 48. The figures show that CRT performs better than Pilercr for larger number of repeats. Like Pilercr, whose speed appears to be independent of the size range of repeats, CRT's speed is independent of the number of repeats contained in a genome. Actually, CRT improves slightly in processing time as the number of repeats increases. This is because it is able to process sections of a sequence containing repeats very fast, as explained in the previous section.
Retrieval Evaluation
In order to assist in determining the effectiveness of the three tools in identifying CRISPR elements, three evaluation measures were used: quality, precision and recall.
Quality
Detected CRISPRs are sometimes inconsistent with their actual form in a sequence. This generally results because DNA repeats are not always exact, and consequently are often difficult to correctly identify. Three common types of inconsistencies were identified in this study. Type I inconsistencies occur when a tool reports a CRISPR that is incomplete (that is, the CRISPR does not contain all of the repetitive sequences). Type II inconsistencies occur when repeats within a CRISPR do not begin and/or end at the correct position. For example, A CRISPR that actually begins with the sequence GTTTAC may be reported as beginning with TTTAC. In this case, it can be seen that the reporting tool is off by one position. Type III inconsistencies occur when a CRISPR is split. For example, a single CRISPR containing 10 repeats may be reported as two CRISPRS, each containing 5 repeats.
Let quality represent the likelihood that a CRISPR reported by a search tool does not contain an inconsistency of Type I, Type II or Type III. Based on this definition, there is no distinction between a CRISPR with one inconsistency and a CRISPR with three inconsistencies. Given the set of CRISPRs resulting from a search, let a be the total number of CRISPRs reported and b be the total number of CRISPRs containing at least one inconsistency. Assuming a > 0, quality (q) for a search tool can be computed as follows.
Precision and Recall
Quality alone is insufficient for measuring performance, as it does not consider the cost of failing to retrieve relevant CRISPRs or the cost of mistakenly retrieving instances that are not CRISPRs. For evaluating inconsistencies of these types, precision and recall are used.
Precision and recall are measures commonly used in the field of information retrieval (IR) when evaluating search algorithms. Their definitions are based on true positives, false positives and false negatives. The descriptions given here are expressed in terms of the tools evaluated in this study.
True positive (TP): the number of instances retrieved that were CRISPRs, False positive (FP): the number of instances retrieved that were not CRISPRs, False negative (FN): the number of instances not retrieved that were CRISPRs.
Determining FN can be problematic because it requires the total number of CRISPRs in the dataset to be known. As is often done in IR, in this study FN is estimated using the composite result sets from all of the available searches tools.
Using the definitions above, precision (p) and recall (r) can be computed as follows.
Precision is the ratio of the number of instances correctly identified to all the instances retrieved. Given an instance from the result set, it represents the likelihood of that instance being a CRISPR. Thus, precision can be used to answer the question, "Did the retrieval system identify a lot of junk (or instances that were not CRISPRs)?"
Recall is the ratio of the number of instances correctly identified to the total number of instances that are CRISPRs (whether retrieved or missed). Thus, recall can be used to answer the question, "Were all of the CRISPRs retrieved?"
In [
14], Godde and Bickerton documented CRISPRs in 101 species with the use of Patscan. From that set, a random sample of size 27 was selected for comparison with results from CRT and Pilercr (using default parameter settings). Between Patscan, CRT and Pilercr, a total of 83 distinct CRISPRs were identified. Using the collective information, quality, precision and recall were computed for each tool. The results are presented in Table under the heading CRISPRs with Cas genes. Note that precision is not applicable for Patscan, because false positives are removed during manual post-processing. Also, the results for CRT are based on a search window length of 8. A search window length of 6 would produce similar precision/recall results, but would have a slightly lower quality score, because the likelihood of Type III inaccuracies is slightly increased.
| Table 1Performance evaluation measures for the examined tools. |
The high scores for CRT and Pilercr show that automatic detection of CRISPRs can be very reliable, even more so than with the use of manual post-processing as is done with Patscan. However, it is not clear whether the lower scores for Patscan were mostly from the human involvement in the detection process or from the Patscan algorithm.
The quality score was highest for CRT. The lower score for Patscan was due entirely to Type I inconsistencies. The categories of inconsistencies for Pilercr were evenly spread, with Type I and Type II inconsistencies usually missing by only small amounts. Precision was highest with Pilercr, while CRT had the best recall score. In this application of precision/recall, recall is more significant as it gives an indication of the number of CRISPRs that were missed by a search tool. Although precision is important, a more sensitive tool that detects most CRISPRs but also reports a few repetitive sequences that are not really CRISPRs is more desirable than a less sensitive tool that misses several CRISPRs but reports very few false positives.
As mentioned above, in order to include Patscan in retrieval evaluations, results were used from Godde and Bickerton. However, they reported CRISPRs only for species that had CRISPR-associated (Cas) genes [
10]. The authors of this study suspect that CRISPRs with Cas genes may have fewer mutations, thus they are easier for search tools to detect. As a result, the tools have higher evaluation scores. For this reason, a second experiment was undertaken using 80 randomly selected finished genomes from the IMG version 1.5 database. Using CRT and Pilercr, a total of 51 distinct CRISPR elements were identified within the 80 genomes. The evaluation scores are shown in Table under the heading CRISPRs with/without Cas genes. These results should be more reflective of the performance of the tools for a typical search. Almost all measures show a reduction in performance. The most noticeable difference is a decrease in precision for CRT and a decrease in recall for Pilercr.