|Home | About | Journals | Submit | Contact Us | Français|
We present GuideScan software for the design of CRISPR guide RNA libraries that can be used to edit coding and noncoding genomic regions. GuideScan produces high-density sets of gRNAs for single- and paired-gRNA genome-wide screens. We also show that by using a trie data structure GuideScan designs gRNAs that are more specific than those designed by existing tools.
CRISPR-Cas9 genome editing1, 2 can be used for both straightforward disruption of single protein coding genes and genome-wide loss-of-function screens3–6. However, application of CRISPR-Cas9 genome editing to the non-coding genome remains limited
Disruption of noncoding RNA or DNA elements using CRISPR-Cas9 often requires the concomitant expression of two gRNAs to engineer deletions 8, 10. Although progress has been made towards making the construction of paired-gRNA vectors scalable11, the absence of tools to design paired-gRNA libraries has hampered progress towards genome-wide non-coding genetic screens. To overcome this limitation we developed GuideScan, an open-source software package that allows users to construct comprehensive and fully customizable gRNA databases for any genome or CRISPR endonuclease and design paired- and single-gRNA libraries (Fig 1a).
The first step in building a gRNA database is identifying the targetable genome space. To accommodate different CRISPR endonucleases with distinct specificity requirements, GuideScan allows the user to define target sequences by setting the values of three parameters: the protospacer associated motif (PAM), the PAM position relative to the gRNA binding sequence, and gRNA length12 (Fig. 1a, middle panel). Non-canonical PAMs can also be specified, since they can be recognized and cleaved by CRISPR proteins with some efficiency and thus may contribute to off-target cutting.
Next, gRNA sequences that can lead to cleavage of multiple genomic loci need to be identified and removed. Because CRISPR endonucleases can tolerate single mismatches in gRNA-DNA pairing13, gRNAs that have less than 2 mismatches to off-target loci are typically avoided14. Third party alignment tools are often used to identify potential off-target loci in the genome, but we and others12 have found that they are unable to return all mismatch neighbors (data not shown) and thus consistently underestimate the number of potential off-targets for a given gRNA15.
Rather than using an alignment tool, GuideScan uses a retrieval tree (trie) data structure, which efficiently and precisely enumerates all targetable sequences present in a given genome (Fig. 1a, right panel). Traversals of the trie allow for the computation of sequence mismatch neighborhoods, which are used to construct databases of gRNAs whose target sites are unique in the genome up to a user-defined number of mismatches (M). Additionally, for gRNAs in the database, more degenerate target sequences – up to Q (Q > M) mismatches – can be correctly enumerated, and stored in the database as potential off-target loci. Each gRNA can be further annotated with additional information, including on-target efficiency scores13 and the genomic feature(s) it overlaps with (e.g. exon, intron, intergenic region). Once constructed, the database allows efficient individual or batch queries of genomic coordinates for single or paired gRNA designs.
To validate GuideScan, we generated a murine Cas9 database with M=1. This database contains over 1.7×108 gRNAs, with an average distance of 15.5 nucleotides between gRNAs designed over autosomes, and is two orders of magnitude larger than the only other available genome-wide gRNA database, provided by mit.edu as a UCSC genome browser track (Supplementary Figure 1a and Supplementary Note).
To test if using Guidescan translated into a better practical outcome in terms of designing gRNAs for precise genomic deletions in the noncoding genome, we took the coordinates of mouse CTCF binding sites, enhancers, miRNAs, and lncRNAs and used GuideScan or the mit.edu genome-wide UCSC track gRNA database to design pairs of gRNAs against each target site. We found that the combined median distances of the GuideScan gRNA pairs to the feature boundaries were 29, 31, 24, and 27 base pairs (bp) respectively, compared with 781, 783, 716 and 774 bp when using gRNAs from the mit.edu genome-wide track (Fig. 1b). Thus, databases generated by our tool are suitable for the engineering of precise genomic deletions and the generation of loss-of-function alleles for noncoding regulatory elements (Figure 1c, d).
To further benchmark GuideScan, we compared it to three widely used gRNA design tools: the mit.edu web interface14 (see Supplementary Note), CRISPRscan10, and E-CRISP16. Unlike GuideScan these tools do not provide direct access to the underlying databases, so we queried 150 regions in the mouse genome overlapping randomly selected protein-coding genes, noncoding elements, and repetitive regions. All tools returned distinct but overlapping sets of gRNAs (Fig. 2a and Supp. Table 1), and surprisingly all except for GuideScan returned a substantial fraction of guides having more than one perfect target site—hereafter referred to as ‘perfect off-targets’ (Fig. 2a, Supp. Table 2). In the most extreme cases, individual gRNAs had more than 30,000 perfect off-targets (Fig. 2b) that were missed by the corresponding design tool (Supp. Table 1) and consequently were not considered when calculating the gRNA’s specificity score. In fact, gRNAs with more than one perfect target site were roughly equally distributed between low (26%), medium (37%), and high (37%) specificity score categories according to the output of the mit.edu web interface11 (Supp. Fig. 1b).
In addition, the competitor tools that we tested underreported the number of off-target sites with single mismatches to the gRNA (Supp. Table 1). Based on these observations we predicted that gRNAs returned by GuideScan should have, on average, greater specificity than gRNAs reported by the other tools. To test this prediction, we used GuideScan’s trie function to enumerate all potential off-target loci (Q=3) for the gRNAs designed by all tools, and calculated the likelihood of cleavage at these sites using a recently reported metric that takes into account the number, position, and nature of mismatches13. We then used these values to calculate the aggregate specificity score for each gRNA as previously described14. We found that gRNAs designed by GuideScan had on average significantly higher specificity scores than those returned by the mit.edu web interface (p<2.2×10−16; D=0.22; Z=7.38) or by CRISPRScan (p<2.2×10−16; D=0.25; Z=6.59) (Fig. 2c). E-CRISP and GuideScan had similar distribution of specificity scores, but E-CRISP returned an order of magnitude fewer gRNAs, some of which had multiple perfect or near-perfect matches in the genome (Fig. 2a, 2b and Supp. Table 2).
Failure to discard gRNAs with multiple perfect target sites has important implications in the design and interpretation of gene editing experiments. The number of DSBs induced by a gRNA in a single cell correlates well with gene-independent gRNA depletion, and is a major source of noise in negative selection screens17. Thus, the correct identification and filtering of promiscuous gRNAs is crucial for designing effective CRISPR libraries. Furthermore, unknowingly using gRNAs with multiple perfect target sites can result in highly efficient gene editing at undesired sites (Fig. 2d) as well as the generation of chromosomal rearrangements18 such as translocations (Fig. 2e) and deletions (Fig. 2f) that include the desired cleavage site and the unknown additional sites.
To facilitate the construction of single and paired gRNA libraries, we have released a web interface that includes access to pre-compiled genome-wide Cas9 and Cpf1 gRNAs databases for many model organisms (www.guidescan.com). This website allows users to input coordinates of genomic features in batch, to choose between designing single internal gRNAs or pairs of flanking gRNAs, and retrieve for each genomic coordinate a pre-defined number of gRNAs or gRNA pairs.
Collectively, these data show that GuideScan is a substantial improvement compared with existing gRNA design tools. We expect that this tool will facilitate ongoing efforts aimed at deducing functions of coding and the noncoding parts of genomes.
To generate a customized genome-wide gRNA database for a given CRISPR system the user supplies the target genome as a FASTA file and specifies parameters such as desired gRNA length, PAM position with respect to gRNA, canonical and alternative PAM sequences, Hamming distance (M) for which gRNAs are required to have a unique target site in the genome, and Hamming distance (Q) for which potential off-targets will be enumerated. Like other methods, the algorithm scans a genome for canonical and alternative PAM sequences and identifies all k-mers associated with them21. The full universe of these k-mers and PAM sequences with their coordinates are written out to a temporary file along with a count of how often a particular k-mer occurs in the genome. At the next stage, the algorithm analyzes the list of k-mers and determines which of these k-mers constitute potential gRNAs. A trie of all k-mers is constructed. The trie stores a user-defined maximum number of k-mer coordinates. If and only if a k-mer occurs with a canonical PAM uniquely in the genome then it is initially labeled as a candidate gRNA. However, if the k-mer occurs more than once in the genome or occurs with an alternative PAM then it is labeled as a non-candidate gRNA. These non-candidate gRNAs are written out to ‘blacklist’ files which define why a k-mer was rejected as a candidate gRNA. Furthermore, the algorithm ensures that each gRNA has a unique target in the genome up to M mismatches, thereby forcing all candidate gRNAs to be distinct from one another by at least M mismatches. This task is accomplished through trie traversals, where for each candidate gRNA sequence a mismatch neighborhood (which in this case is the set of inexact matches to the candidate gRNA up to M mismatches) is assessed. If even one mismatch neighbor sequence is found for a given candidate gRNA, then this gRNA is not unique in the genome up to M mismatches and is therefore relabeled as a non-candidate gRNA and written to the ‘blacklist’ file. The gRNAs that are unique in the genome up to M mismatches can have off-targets completely enumerated up to Q (where Q > M) mismatches. Off-target information for candidate gRNAs is again determined by trie traversals and computing the mismatch neighborhood for a given gRNA sequence. By construction, the gRNAs undergoing off-target enumeration have no off-targets within M mismatches, and so if an inexact match to the gRNA sequence with P mismatches is found, such that M < P ≤ Q, then the number of times this off-target sequence occurs, its Hamming distance P, and coordinates for a user-defined number of off-target are recorded. Candidate gRNAs are written from the trie to a file in sequence alignment map (SAM) format19. This file has the gRNA sequence as a unique identifier as well as off-target information for the gRNA stored in a hex-byte array. SAM tags record a maximum count of off-targets, the distance for which off-targets were searched, and the hex-byte array of off-target sequences and their coordinates. This SAM file is then converted to a BAM file and indexed so that it can be quickly accessed using Samtools.
The guideRNA database can be enriched with additional information such as on-target cutting efficiency scores. We adopt the scores defined by Doench et al. Rule Set 213 Scores are computed for each gRNA and added to the previously constructed database. This rule set predicts on-target gRNA cutting efficiency using a learned boosted regression tree model. The input for this model requires a 30mer sequence, only assesses cutting efficiency for 20mer gRNAs, and only predicts cutting efficiency for gRNAs with the canonical Cas9 PAM sequence (NGG).
A database BAM file, composed of Cas9 gRNAs with 20mer complementary region and NGG PAM, can be supplemented to include gRNA specificity scores. Specificity scores for gRNAs are based on the Doench et al. CFD model13, which computes the likelihood of a gRNA cutting at each individual off-target site using an experimentally derived mutation matrix. Specifically, for a given gRNA GuideScan enumerates all its neighbors up to Q mismatches, calculates the CFD score for each neighbor, and then multiplies that score by the number of times the neighbor occurs in the genome. It then aggregates the CFD values into a single composite score utilizing the formula used by Hsu et al.14:
Here, n represents the number of unique targetable sites within up to z mismatches. The desired on-target site (z = 0) is included in this computation and will give a CFD value of one. The value qi represents the number of times the ith neighbor occurs in the genome. For a unique target site up to z mismatches, the specificity score would be 1 since CFD = i = n = qi = 1. The resulting composite specificity score is written out to a text file along with information such as the target sequence and target coordinates. These scores are then added to the BAM file as a new SAM tag. Importantly, because the GuideScan algorithm is general and allows the construction of gRNA databases for distinct enzymes and distinct gRNA lengths, specificity scores are not automatically computed. If a user chooses to compute these scores, GuideScan first determines whether the database conforms to the parameters required for CFD scoring, and if so computes the aggregate specificity score for each gRNA in that database. For the specificity scores in the pre-computed Cas9 databases on the GuideScan web-interface, the parameters z = Q = 3 were used.
The algorithm allows for the direct query of the BAM file database using PySam (python interface to samtools; https://github.com/pysam-developers/pysam) and allows for the lookup of gRNAs by genomic coordinates. The output contains the gRNAs in the queried region, off-target information, predicted Rule Set 2 cutting efficiency score if appropriate, as well as the exon annotation of on-target gRNA and off-target cut sites. The exon annotation relies on the creation of an interval tree constructed from a BED file containing genome-wide exon coordinates. Consequently, the exon annotation for gRNAs is done at the time of database query.
The code is freely available at guidescan.com, in Supplementary Code and at bitbucket (bitbucket.org/arp2012/crispr-project/overview). Version v0.0.4 of the code was used to generate the data presented in this manuscript.
The algorithm was run on a select set of model organisms, as well as on human genome, to produce gRNA databases where all constituent gRNAs are unique in the genome up to two mismatches. These databases are accessible through a web interface that allows for batch queries by genomic coordinates. The website allows the user to find gRNAs within a genomic region or flanking a genomic region and allows the output to be sorted according to either number of off-targets, distance to target site, or cutting efficiency score. The website utilizes the database query ability of our algorithm to rapidly report gRNAs for user defined regions and achieves its functionality through CherryPy, a python web framework (http://www.cherrypy.org/).
The comparisons shown in Figure 1b were done using the coordinates mouse CTCF binding sites, enhancers, miRNAs, and lncRNAs retrieved from the following publications 21–24. The outputs of GuideScan, crispr.mit.edu web portal14, CRISPRscan10, and E-CRISP16 were compared on sequences overlapping randomly chosen protein coding genes, noncoding genomic elements, and repeat masked regions. We limited our test to a total of 150 sequences (50 protein coding genes, 50 noncoding elements, 50 repeat masked regions) from mm10 (the most recent assembly of M. musculus genome) because the input limits (file size/number of sequences) associated with some of these tools made larger scale comparisons impractical. Similarly, we limited the size of the test sequences to 150 base pairs due to input limits of some of the tools. The genomic coordinates of the sequences used in these comparisons are provided in Supplementary Table 3. Importantly, the CRISPRscan tool provides both 19mer and 20mer gRNA designs. For the purpose of this experiment we limited our analysis to 20mers because they could be directly compared to the gRNA designs provided by the remaining softwares. The number and type of off-targets for each gRNA was determined by querying GuideScan’s software package intermediate kmers file and independently through a R query of BSgenome mm10 version.
A one-sided Kolmogorov-Smirnov test was used to compare the specificity scores of 1839 GuideScan gRNAs against 267 E-CRISP gRNAs, 1189 CRISPRScan gRNAs, and 2641 MIT gRNAs; P values were computed using the ks.test function in R. Additionally, the effect sizes (Z) of these comparisons were reported using the formula
where D is the Kolmogorov-Smirnov statistic equaling the maximum difference between empirical CDF functions of the two samples in the comparison, and n1 and n2 are the sample sizes.
Paired-gRNA vectors to generate deletions of DNA and RNA noncoding elements were cloned as previously described11. Briefly, s/mU6 oligos carrying the sequence corresponding to 2 gRNAs for each locus (miR-290~295: gRNA1, TAGTACATCGGTCTAACTCA; gRNA2, GTTGAGACTAAAGGTAATCC. Enhancer element: gRNA1, AGCTACCCCGTAACCAAGTG; gRNA2, AAGGCCATATAGTTGTCGCC) were PCR amplified and cloned into BsmBI digested lentiCRISPRv1 vector (addgene #49535) using pDonor_sU6 intermediate vector (addgene #69351). Single gRNAs were cloned into BsmBI digested lentiCRISPRv1 vector using standard oligo cloning protocols.
The V6.5 Mouse Embryonic Stem cells (obtained from Rudolf Jaenisch) were tested for germline transmission and for the absence of mycoplasma infection. These cells were grown on a monolayer of irradiated mouse embryo fibroblasts at 37°C (5% CO2) in KnockOut DMEM media (Gibco) supplemented with 15% FCS, L-glutamine (2mM), penicillin (100U ml^−1), streptomycin (100μg ml^−1), and LIF (103 U ml^−1). To generate genomic deletions, paired-gRNA vectors were transfected using Lipofectamine 2000 (Invitrogen) according to manufacturer’s protocols. Cells were collected in lysis buffer (100 mM Tris-HCl pH8.5, 200 mM NaCl, 5 mM EDTA, 0.2% SDS and 100 ng ml^−1 proteinase K) 6 days following transfection, and genomic DNA extracted using phenol-chloroform followed by ethanol precipitation. Genomic deletions were detected by PCR using primers flanking the gRNA cut sites (miR_fwd: AGGGAGGAACGAGCCTATGT, miR_rev: GCATGCCTAAATCCCAAGAG; enh_fwd: GTGGCTCAGTGTTTCCCATT, enh_rev: CAGGCAAACTCTCCCATGTT). PCR bands corresponding to the genomic deletions were cloned into the Topo Blunt II vector (Invitrogen) and plasmid DNA from individual bacterial clones subject to Sanger sequencing.
To test cleavage and rearrangements produced by single gRNAs, 293T cells (obtained from ATCC, cultured under standard conditions, and tested negative for mycoplasma contamination) were plated on 12-well-plates (Corning). Constructs expressing individual gRNAs were transfected into cells the following day using lipofectamine 2000 (Invitrogen) and transfected cells selected with puromycin (4 μg ml^−1) for 2 days. Four days following transfection, genomic DNA was extracted as above and used to determine total cleavage and generation of genomic rearrangements. Total editing at perfect target sites was determined using a T7 endonuclease assay. Briefly, target sites in a pool of transfected cells were amplified by PCR, and 200ng of resulting amplicon were used to generate DNA heteroduplexes. The resulting molecules were incubated with 10U of T7 endonuclease (NEB) for 15 minutes and the product of the digestion run on a 2% agarose gel. Total editing estimates were calculated as previously described25. Sequences or gRNAs and primers used in these experiments are shown in Supplementary Table 4.
We thank members of the Ventura and the Leslie laboratories for comments and suggestions. We thank Lauren Fairchild and Rafi Pelossof for providing source code for the SplashRNA web server to serve as the backbone for the GuideScan website. This work was supported in part by NIH: grants P30-CA008748 (MSK Core), U01-HG007033 (CSL), U01-HG007893 (CSL), and by grants from the Geoffrey Beene Cancer Research Foundation (AV), the Uniting Against Lung Cancer Foundation (AV), the Cycle for Survival Foundation (AV), the Pershing Square Sohn Cancer Research Alliance (AV), and the Lung Cancer Research Foundation (JAV). The GuideScan source code and all associated documentation are deposited at guidescan.com.
Author ContributionsJ.A.V., C.S.L., and A.V. conceived and supervised the project. Y.P. and A.R.P. developed the GuideScan algorithm with input from C.S.L.; A.R.P and Y.P. implemented the GuideScan software package; A.R.P. performed the computational experiments; J.A.V. performed the wet-lab experiments; A.R.P and S.C. implemented the web-server; L.Z. provided expertise in software development and helped improve the website user experience; J.A.V drafted the manuscript with contributions from all authors.
Competing Financial Interests
The authors declare no competing financial interests.