We sought to improve the original strategy of Liu and colleagues by addressing its inability to interrogate cleavage site libraries in vitro to a depth sufficient to identify all possible off-target sites present in the human genome. To do this, we added a machine-learning-based step that uses cleavage site preferences from the in vitro selection experiments to predict what sequences in the human genome are most likely to be cleaved (). We used standard machine-learning techniques to construct Naïve Bayes classifiers that quantify how the nucleotide identity at each position within a DNA site differs between members of a partially degenerate library that were cleaved efficiently in vitro and those that were not (‘Materials and Methods’ section). The scores generated by each classifier range from 0 to 1, with lower scores representing a higher probability that any given site will be cleaved (‘Materials and Methods’ section).
Figure 1. Schematic illustrating the original method by Pattanayak et al. (8) (blue arrows) and the enhanced approach that incorporates addition of a classifier-based step (green arrows).
We performed an initial test of our approach by developing a classifier based on in vitro
site selection data previously obtained for ZFNs targeted to a site in the human CCR5
gene. As shown in Supplementary Table S1
, application of this CCR5 ZFN classifier to the human genome resulted in the overwhelming majority of potential target sites having a high classifier score: 11 421 321 184 of 11 421 337 066 potential sites (99.999861%) received a score higher than 0.75. By contrast, only 15 882 sites (0.000139% of all potential sites) had a score lower than 0.75, and only 1123 sites (0.00000983% of all potential sites) had a score below 0.5. Importantly, all 12 bona fide
off-target sites identified previously by the in vitro
cleavage site selection, and the IDLV integration methods had scores below 0.75. In addition, 11 of these 12 sites fall within the top 25% of sites with scores below 0.75 (Supplementary Table S2
Having established classifier score cutoffs that enable identification of all previously known off-target sites for the CCR5-targeted ZFNs, we next prospectively tested whether other sites with scores below 0.75 might include additional bona fide off-target sites. However, a comprehensive analysis of all sites with scores below 0.75 would require deep sequencing of 15 882 different alleles, an experiment that would be challenging and expensive to perform, given the current cost of next-generation sequencing. Therefore, we instead systematically assessed a smaller sampling of sites by first grouping them based on their position in exonic or non-exonic genomic sequence and then binning sites within each of these groups according to their classifier scores (i.e.—0.0 to 0.1, 0.1 to 0.2, etc.). To achieve high levels of nuclease activity that would facilitate detection of lower frequency off-target events, we used conditions described by Liu and colleagues to overexpress CCR5-targeted ZFNs in K562 cells (‘Materials and Methods’ section). We then used deep sequencing to assess the top 13 scoring sites (if available) within each bin for evidence of NHEJ-mediated indel mutations in the genomic DNA of these cells.
Analysis of 138 sites identified NHEJ-mediated indel mutations not only at the intended CCR5 target site and at a previously known off-target site in the CCR2 gene but also at 21 new off-target sites (). As expected, the percentage of bona fide
off-target sites found within each classifier score bin was inversely correlated with the magnitude of the score (i.e.—a greater percentage of actual off-target sites were identified in the lower score bins). For example, 35% (16 of 46) of the screened targets with scores in the first tercile (lowest scores) showed significant evidence of NHEJ-mediated indel mutations compared with 13% (6 of 46) and 2% (1 of 46) of sites with scores in the second and third terciles, respectively (Supplementary Table S3
Off-target sites for ZFNs targeted to CCR5 displaying significant evidence of ZFN induced indels grouped by classifier probability score
To test the generalizability of our classifier-based approach, we used it to predict off-target sites for another pair of ZFNs targeted to the human VEGFA
locus (Supplementary Table S4
). Previous work using the in vitro
cleavage site selection assay had identified a large number of potential off-target sites for this ZFN pair in human cells (Supplementary Table S5
). We used this selection data to build a classifier that we used to score every possible site in the human genome (‘Materials and Methods’ section). As we observed with the CCR5 classifier, only a small number (7242) of genomic sites had a classifier score below 0.75, and only 936 sites had a score below 0.5. In addition, all 31 bona fide
off-target sites identified previously with the in vitro
selection data all had scores below 0.6, with all but one of these sites having scores below 0.5 (Supplementary Table S6
). We assessed 159 potential off-target sites (identified using the same stratified sampling approach we used for the CCR5 ZFNs) for evidence of off-target mutations from genomic DNA of human cells in which the VEGFA
-targeted nucleases had been expressed. This systematic stratified analysis identified 34 bona fide
off-target sites, including eight that were previously identified by Pattanayak et al.
) and 26 that were novel (). We note that that the majority of these novel off-target sites had low classifier scores, again demonstrating the predictive capability of our method (). Furthermore, several of the sites we predicted to be off-target sites that did not show a statistically significant level of NHEJ mutations in this study had been previously confirmed as off-targets when screened with a greater depth of sequencing reads by Pattanayak et al.
), suggesting that a greater number of the predicted off-target sites might show evidence of mutation with deeper sequence sampling.
Off-target sites for ZFNs targeted to VEGFA displaying significant evidence of ZFN induce NHEJ grouped by classifier probability score