|Home | About | Journals | Submit | Contact Us | Français|
Analysis of sites of newly integrated DNA in cellular genomes is important to several fields, but methods for analyzing and visualizing these datasets are still under development. Here, we describe tools for data analysis and visualization that take as input integration site data from our INSPIIRED pipeline. Paired-end sequencing allows inference of the numbers of transduced cells as well as the distributions of integration sites in target genomes. We present interactive heatmaps that allow comparison of distributions of integration sites to genomic features and that support numerous user-defined statistical tests. To summarize integration site data from human gene therapy samples, we developed a reproducible report format that catalogs sample population structure, longitudinal dynamics, and integration frequency near cancer-associated genes. We also introduce a novel summary statistic, the UC50 (unique cell progenitors contributing the most expanded 50% of progeny cell clones), which provides a single number summarizing possible clonal expansion. Using these tools, we characterize ongoing longitudinal characterization of a patient from the first trial to treat severe combined immunodeficiency-X1 (SCID-X1), showing successful reconstitution for 15 years accompanied by persistence of a cell clone with an integration site near the cancer-associated gene CCND2. Software is available at https://github.com/BushmanLab/INSPIIRED.
Retroviruses, transposons, and other mobile DNA elements directly integrate their DNA into the chromosomes of host cells.1, 2, 3, 4, 5, 6, 7, 8 Distributions of newly integrated DNA elements can be characterized using next-generation sequencing, as is described in the companion paper in this issue of MolecularTherapy: Methods & Clinical Development9 and in many previous studies (e.g., Bushman,1 Craig et al.,2 Schröder et al.,3 Mitchell et al.,4 Maldarelli et al.,6 Cohn et al.,8 Wu et al.,10 Hoffmann et al.,11 and Biffi et al.12). Contemporary methods take advantage of Illumina paired-end sequencing to report the location of newly integrated DNA.13, 14, 15, 16 Multiple reports have described methods for quantifying and analyzing such data, but optimal methods for statistical analysis, data reduction, and data visualization are the topics of ongoing development.13, 14, 16, 17, 18, 19 Here, we describe a suite of tools for integration site analysis and visualization that are useful for characterizing samples from human gene therapy and other applications.
In the case of gene modification in circulating blood cells, it is possible to sample cell populations from blood longitudinally and sequence sites of integration of the gene-correcting vector.20, 21, 22, 23, 24, 25, 26, 27 An important question centers on how best to quantify the numbers and types of gene-modified cell clones contributing blood cells to the periphery. For example, adverse events have been reported where expanded cell clones in blood became frank leukemia,28, 29, 30, 31 and this can be tracked using quantitative integration site data. Complicating the analysis, simply counting the number of integration site sequence reads does not accurately report clonal abundance because of distortions resulting from PCR steps in the integration site recovery procedure.32, 33
We have previously described a method for abundance estimation based on paired-end sequencing of PCR products containing integration site sequences that allows for accurate quantification of gene-modified cells.34 DNA is sheared using sonication, DNA linkers are ligated to free DNA ends, and then samples are amplified using primers complementary to the integrated vectors and ligated linkers. Genomic sequence information is acquired from both the linker end and the integrated vector end. For the case of an expanded clone, many different DNA breaks and sites of linker ligation are associated with the unique integration site from the expanded clone. This allows for the estimation of abundance using the number of different linker ligation sites as a surrogate for the number of cells sampled. We have published statistical tools for analysis of such data and applied these tools to track several gene therapy trials.20, 21, 25, 26, 34, 35, 36, 37, 38, 39 Other groups have also used related methods.6, 8, 24, 40, 41, 42, 43, 44
Here, we present tools for integration site sequence analysis and the quantification of clonal abundance, and describe applications in human gene therapy. These methods can also be used for tracking latently infected cells in HIV-positive subjects and monitoring experiments using insertional mutagens, and in mechanistic studies of DNA integration. We describe a heatmap format for the analysis of relationships among integration site distributions, genomic features, and sites of epigenetic modification. These analyses allow users to carry out numerous custom statistical comparisons with annotations, random distributions, and other datasets by simply pointing and clicking. We also present a series of analytical tools for use with patient samples to characterize integration site population structure and possible adverse events. Results are packaged into reproducible reports (html or pdf file format), allowing for version tracking of the code, datasets used, and external datasets queried. Using these tools, we describe examples of tracking a subject from the first gene therapy trial to treat severe combined immunodeficiency-X1 (SCID-X1) deficiency. The data demonstrate durable reconstitution accompanied by a clonal expansion of cells harboring an integrated vector near the cancer-associated gene CCND2.25
The INSPIIRED pipeline is summarized in Figure 1. The first steps involve the generation of a sequence library and sequence data acquisition, genomic alignment, analysis of viral integration sites in repeated sequences, and quality controls (described in the accompanying paper9). The intSiteCaller program takes FASTQ files as input. After sequence quality filtering, trimming of DNA sequences added during library construction, and sequence alignment, unique sites and integration sites in repeated sequences (“multihits”) are saved in an intermediary binary format (RData file format). All sites from each run of the sequencing instrument are uploaded into a MySQL database (IntSiteDB) for storage and for use in downstream analysis using a utility script (intSiteUploader). Alternatively, the INSPIIRED pipeline also supports use of a SQLite database, which is provided with the software. The IntSiteDB stores genomic locations of integration sites together with PCR break points and their counts, which is used for estimation of abundance.34 All downstream analyses are carried out using genomic locations and sonic break points. The IntSiteDB schema is shown in Figure S1.
We use intSiteRetreiver, a key component of the INSPIIRED software package, to retrieve unique sites and multihits for chosen samples from the IntSiteDB for analysis. The database organization allows the combined analysis of multiple samples from different instrument runs. Integration sites are annotated with the hiAnnotator R software package (http://bioconductor.org/packages/release/bioc/html/hiAnnotator.html), which makes use of genomic features compiled by the UCSC Genome Bioinformatics Group.
For studies of gene therapy subjects, patient metadata and specimen information are stored in an accompanying gene therapy specimen management database that includes anonymized patient identifiers, cell types analyzed, and time point data. These features are then used in the gene therapy patient reports described below. The separation of the pipeline into several output products (patient report and heatmaps) and databases (integration site, patient metadata, and annotation) provides flexibility in development and use.
We use heatmaps to summarize the relationships between integration site distributions and genomic annotations (Figures 2, ,3,3, and and4).4). These maps were introduced in Berry et al.,35 which presents more background and examples of their uses. The heatmaps summarize information on integration site datasets in columns and different genomic features in rows. For each comparison, integration site distributions are compared with distributions of randomly selected sites in the human genome.4, 35 Early integration site recovery methods involved use of cleavage by restriction enzymes, which are unevenly distributed in the human genome.32, 33 For this reason, random controls were matched based on proximity to restriction enzyme cleavage sites.35 The INSPIIRED pipeline uses random cleavage by sonication, so purely random control sites are generated in silico and used in the analysis described here.
In the heatmaps, colored tiles indicate the intensity and direction of any departures from the distributions of random controls for each genomic feature in each integration site dataset. Three random sites are picked per integration site. The locations are then annotated using the hiAnnotator R package.
The coincidence of genomic feature “J” with each integration site and random control site is measured. The nonparametric method of estimating receiver operating characteristic (ROC) curve areas and their covariance structure of DeLong et al.45 is used. Each integration site is compared in a pairwise fashion with random control sites, and a number is assigned indicating the relative rank of the integration site: 1 if the measurement of J is higher at the integration site than at a random control site, 0 if the measurement of J is lower at the integration site than at a random control site, and 0.5 if the measurement of J is equal for the two sites. All such values are calculated for a dataset of integration sites and averaged to obtain the overall ROC area for the feature measured (https://github.com/BushmanLab/hotROCs). This is equivalent to comparing the ranks of the sites with those of the controls. In older datasets with integration sites recovered by cleavage with restriction enzymes, matched random controls based on proximity to restriction enzyme cleavage sites were used. In that setting ROC curve areas were based on comparing each site only with its matched controls.35
An ROC area between 0 and 0.5 indicates the genomic feature occurs less frequently at or near integration sites than at or near random sites in the genome and is therefore disfavored. An ROC area between 0.5 and 1 indicates the genomic feature is enriched at integration sites. An ROC area of exactly 0.5 indicates that integration sites in the dataset are neither enriched nor depleted with respect to the feature of interest. The ROC area is converted to a color tile according to the colorimetric scale shown at the bottom of the heatmap.35 In Figure 2, positive associations (enrichment compared with random) are shown as increasingly intense shades of blue, negative associations (depletion compared with random) as increasing intense shades of yellow, and no difference from random as black. Each tile represents a comparison of integration sites with the randomly sampled controls for one genomic feature (row) in one experimental dataset (column).
Note that we do not present the magnitude of effect in terms of the original units of measurement. We simply ask whether the average integration site has a higher rank for a given type of feature than its matched random control sites. The color indicates the average quantile of each integration site relative to its random controls. This removes skewing effects contributed by non-normal distributions of the data and also reduces the effect of data points with extreme values for a feature. Statistical tests are carried out to determine whether the ROC areas calculated are significantly different from one another or from 0.5 (indistinguishable from random controls; methods are further explained in Supplemental Materials and Methods).35, 46
The heatmap shown in Figure 2 compares the integration site distribution at two time points from a gene-corrected patient (patient 1 [P1]) with epigenetic marks mapped in CD133+ progenitor cells. Patient 1 (P1) was treated with an early gammaretroviral vector, used to deliver the missing IL2RG gene to treat SCID-X1.47, 48 The two samples were isolated from peripheral blood mononuclear cells (PBMCs) taken at 177 and 189.5 months after gene therapy. For comparison, another sample using a lentiviral-vector-infected human-derived HAP-1 cell line has been included to illustrate differences with the lentiviral integration pattern. Quadruplicate assays for each DNA sample are shown to illustrate reproducibility.
The distributions of integration sites datasets (Table S1) were compared with the distributions of 10 different epigenetic marks or bound DNA binding proteins.49 Each of these was mapped by chromatin immunoprecipitation sequencing (ChIP-seq), in which each protein was covalently cross-linked to DNA, and bound DNA fragments were recovered by immunoprecipitation, sequenced, and then mapped to the human genome to identify relative density. Densities of mapped ChIP-seq annotations were compared with distributions of integration sites within 10 kb windows, and the collection of values was used to generate ROC areas.
For the gene therapy specimens made by infection of stem cells with a gammaretroviral vector (patient 1 [P1]), the distribution mostly favored marks associated with active transcription (H3K9me1, H3K4me1, H4K20me1, and H3K27me1). Integration was disfavored near marks associated with repressive chromatin (H3K27me3 and H3K9me3). However, the gammaretroviruses favor integration near transcription start sites,10 and H2AZ and H3K4me3 were positively associated, and RNA polymerase II (Pol II) more strongly than for lentiviruses.
For the lentiviral infection in HAP1 cells,50 integration is favored near sites of covalent modifications of histones associated with active transcription, including H3K9me1, H3K4me1, H4K20me1, H3K27me1, and H3K6me3. Lentiviral integration is favored within transcription units,1, 3, 4 where the H3K36me3 mark is found, so integration was favored near this mark for lentiviral infection, but not gammaretroviral infection. Repressive chromatin marks were disfavored, including H3K27me3 and H3K9me3. Marks found near transcription start sites (H3K4me3 and H2AZ) were either slightly disfavored or neither favored nor disfavored.51 These patterns parallel those seen previously for lentiviral vectors in diverse cell types.4, 8, 11, 52, 53
An added feature of these heatmaps is that they have been engineered to allow interactive statistical tests (Figure 3; interactive heatmaps are available in Supplemental Information and Data S1). Heatmaps are generated as scalable vector graphics (SVG), which can be opened in an Internet browser. Users can click on a row or column, and statistical results appear on the heatmap tiles documenting whether results in other rows or columns differ from the query. Users can also click on a button to the right of the maps to allow comparison of all tiles with the random control. Results of statistical comparisons are reported as asterisks on each tile. Some examples are shown in Figure 3, illustrating comparisons among random (Figure 3A), the leftmost HAP1 dataset (Figure 3B), or the Pol II ChIP-seq distribution (Figure 3C).
Figure 4 presents another form of the heatmap that queries the results of multiple additional features, including mapped DNase I cleavage site (which reports DNA accessibility), CpG islands (important in gene regulation), guanine/cytosine (GC) percentage, gene counts as documented in the refSeq dataset, and proximity to gene boundaries. For those features that are mapped in intervals (GC percentage over 1 Mb, 100 kb, 10 kb, and so on), it is often unknown a priori what width is the most relevant to the biological question at hand. Thus, for these features, results for a number of different interval sizes are shown.
All three datasets are compared over their four replicates against these features (Figure 4). High densities of DNase I hypersensitive sites and high densities of CpG islands are associated with favored integration for both lentiviral and gammaretroviral vectors (red coloration). High GC content is also favored, likely because high GC content is characteristic of gene-rich regions, although for lentiviruses, the preference switches to local high adenine/thymine (AT), possibly associated with binding of LEDGF/p75, the tethering cofactor, or wrapping of integration target site DNA on nucleosomes.11 Paralleling the favored high GC content, direct measures of gene richness are also positively associated. Integration is disfavored relative to regions with long gene widths or long intergenic distances, because these are indicative of gene-sparse regions. The gammaretroviral vector sites are disfavored relative to long gene boundary distances because they favor integration near transcription start sites. Integration is favored within genes (as annotated by the refSeq dataset) for lentiviral vectors,3 but only weakly favored for gammaretroviral vectors.
Thus, numerous relationships between integration site datasets and genomic features can be explored statistically using these interactive heatmaps.
A question of interest in many therapeutic applications centers on whether integration sites accumulate near the transcription start sites of cancer-related genes. A complication is that there are many ways of defining cancer-associated genes, and most such genes are important only in specific types of human cancers. For annotating gene therapy results, we have thus generated multiple lists of cancer-associated genes that can be queried as appropriate for integration site analysis (http://www.bushmanlab.org/links/genelists).
In one approach, we created a maximally comprehensive list (AllOnco) for use in first-pass screening based on the idea that we cannot predict what cancer-associated genes are most important in the novel clinical setting of human gene modification. The list incorporates known human cancer genes and human homologs of cancer genes in model organisms, and so includes 2,125 total genes, or roughly 8.5% of all human genes (assuming 25,000 total). Comparison with oncogene annotation is summarized using the heatmap format (Figure 4, bottom row), which scores the frequency of integration sites within 100 kb of cancer gene transcription start sites in integration sites versus random sites.
An important application of integration site analysis is tracking outcome in human gene therapy. For this we have developed a standardized patient report template that rests on top of the INSPIIRED pipeline. Use of a reproducible report format allows tracking of datasets used in each study and version control of code (which are specified by dates). The report software takes in integration site and break point positional information, annotates the sites using hiAnnotator, and outputs targeted analyses of integration site distributions. An example of a patient report is provided in Data S2, showing two recent time points monitored for patient 1 (P1) treated for SCID-X1. Earlier time points were analyzed by 454 Roche pyrosequencing and were previously reported.25
Some excerpts from the report are presented in Figure 5. The software generates a summary table (Figure 5A) that reports the patient, time point, cell type, patient metadata, and summary statistics for each sample. Among these are the total numbers of reads, the number of cells inferred to have been sampled (sum of break points captured), and the number of unique integration sites after dereplication. Four statistics summarizing population structure are also calculated for each sample. The minimum population size is inferred from a Chao1 analysis with jackknife correction, which takes advantage of the four replicate analyses typically run for each sample.34 Skewing in proportional abundance is calculated using the Gini index, where 0 indicates a perfectly even distribution of integration sites over the cells sampled and increases up to 1 with increasing oligoclonality. Diversity is calculated using the Shannon index, which summarizes both the number of different unique integration sites and the evenness of distribution of cells sampled (SonicAbundance) among integration sites.
Here, we introduce a new metric, called the UC50 (unique cell progenitors contributing the most to the expanded 50% of progeny cell clones). To generate the number, progenitor cell clones (reported as unique integration sites) are first ranked by the relative abundance of progeny cells using SonicAbundance (reported by linker ligation site data). The UC50 then reports the number of unique clones (integration sites) responsible for making up the top 50% of all cells sampled. Thus, if a single clone comprises more than 50% of the sample, the UC50 value will be 1. In contrast, for efficient lentiviral infections of cells in short-term tissue culture, the UC50 values can be in the thousands (data not shown).
Finally, where available, the vector copy number per cell (VCN), determined separately by qPCR, is added to allow assessment of the efficiency of gene marking in the cell population.
The relative abundance of integration sites in or near specific genes is summarized in several ways (Data S2). These include two types of stacked bar graphs (an example is shown in Figure 5B; both are displayed in Data S2). Bar graphs either display the number of cells observed (sonic breaks) that are associated with integration sites in the samples (scaled to the most abundant sample) or the proportional abundance of integration sites in each sample, where every sample is scaled to 100% (Figure 5B). Each high-abundance integration site is named by the closest gene, which is color coded in the key for each graph. Genes are annotated by whether the site is within a transcription unit (*), is within 50 kb of a cancer-related gene (~), or is associated with a gene strongly associated with human lymphoma (!). This last list consists of 38 human genes commonly involved in lymphoid cancers and includes most genes previously implicated in adverse events in human stem cell gene therapy, including LMO2, MDS/EVI1, and CCND2.29, 30, 31
Heatmaps provide another type of visualization (Data S2). These have the advantage over stacked bar graphs in that each integration site above a chosen abundance is given equal space, whereas in stacked bar graphs, rarer sites can be shown as thin bands that may be difficult to visualize. A third visualization is provided by line graphs (Data S2), which highlight the behavior of the most abundant clones over time.
Another set of figures queries integration sites near genes of concern for adverse events in human gene therapy, including LMO2, CCND2, HMGA2, and MECOM.29, 37 In these visualizations, the distance from the transcription start site is shown on the x axis, proportional abundance is shown on the y axis, and the time point is color coded. By this means it can be seen that an integration site near CCND2 achieves >10% abundance near the CCND2 transcription start site (Figure 5C).
Handling integration sites that map to multiple locations in the human genome presents a particular challenge. It could be that an integration site authentically resides in a repeated sequence that is also in the 5′ region of a cancer-associated gene and potentially marking an adverse event. To accommodate this, reports include an account of the SonicAbundance of cells with integration sites in repeated sequences,34 allowing tracking of possible sites of concern in multihits.
The reports end with word bubbles (Figure 5D), which provide a visual summary of genes near integration sites in expanded clones.27 Names of genes that are nearest to the integration sites are used to construct the word bubble. Word size is scaled by the SonicAbundance measure for each integration site, and each gene name is marked with the same integration flags (*, ~, and !) used in earlier plots. By this means, the major clones in the sample are evident at a glance.
Figure 5 shows excerpts from a reproducible report summarizing monitoring of patient 1 (P1) from the first trial to treat SCID-X1. Results are summarized for PBMCs from two time points, 177 and 189.5 months after gene therapy. Half a million to a million reads were collected for each sample, allowing investigation of 13,000 cells associated with about 1,000 integration sites. UC50 values for the two time points are 10 and 8, indicating the presence of expanded cell clones.
In early studies based on 454/Roche pyrosequencing, the subject was found to have an expanded clone with an integration site near CCND2 (6% of all reads), a gene for which a nearby integration event was associated with an adverse event in another SCID-X1 gene-corrected patient.25, 29 Thus, it was of interest to monitor the behavior of the clone in this patient over time. Analyses using Illumina paired-end sequencing are summarized in Figure 5, which shows that the CCND2 clone has slightly expanded in abundance (nonparametric comparison of replicate medians yields p = 0.029 when compared by relative abundances and p = 0.057 when compared by absolute abundances judged by SonicAbundance). The integration site is 3,241 nt upstream of the CCND2 transcriptional start site. Thus, longitudinal tracking reveals a stable expanded clone in this subject.
Here, we describe a collection of tools for the analysis and visualization of integration site distributions. This tool set takes advantage of the INSPIIRED pipeline described in the accompanying paper.9 Integration sites are mapped to the human genome, the abundances of their host cells tabulated using the SonicAbundance method, and results stored in a database, allowing flexible downstream analysis. Integration site distributions can then be compared with genomic features and sites of epigenetic modification as ROC areas. These data are summarized as interactive heatmaps, allowing comparison with random distributions, other integration site datasets, or genomic annotation by simply clicking on a row or column, which outputs comprehensive statistical tests.
For applications to human gene therapy, interest often focuses on longitudinal behavior of expanded clones and proximity of integration sites to cancer-associated genes. A standardized report format was developed, allowing interactive comparison among patient datasets and querying multiple aspects of longitudinal behavior. For this, we introduce the UC50 metric, which is generated by ranking progenitors (integration sites) from most to fewest daughter cells produced (linker ligation sites) and counting the number of progenitors contributing to the top 50% of the distribution. Thus, clonal expansion yields low UC50 numbers and highly polyclonal samples high UC50 numbers. These tools were used to query recent clonal behavior in a patient from the first SCID-X1 gene therapy trial.29, 47, 48 The patient studied has an expanded clone with an integration site near the proto-oncogene CCND2. The analysis of integration sites from month 177 to month 189.5 post-treatment revealed stability of this clone, with possible slow expansion. This analysis illustrates how the tools described here can be applied to monitor outcomes in gene therapy.
As in Cavazzana-Calvo et al.47 and Hacein-Bey-Abina et al.,48 patient 1 (P1) fulfilled the eligibility requirements for first ex vivo γc gene therapy trial (1999–2002) at age 11 months. P1 was diagnosed with SCID-X1 based on his blood lymphocyte phenotype, revealing a tail-less γc receptor expressed at the membrane (R289 X). Marrow was harvested and subjected to CD34+ cell separation, obtaining 9.8 × 106 CD34+ cells per kilogram of body weight. Harvested cells were then exposed to MFG γc vector-containing supernatant daily for 3 days. P1 was then infused with the treated CD34+ cells (19 × 106 cells/kg) without prior chemoablation.
As explained in the companion paper,9 integration sites are identified by sequencing the LTR-host junctions from genomic DNA after linker-mediated PCR amplification. Genomic DNA is randomly sheared by ultrasonication, after which linkers are ligated to the repaired DNA for amplification. Nested PCR is used to amplify the LTR-host DNA junctions by priming from the viral LTR and the linkers, appending the sequences needed for sequencing. Samples are sequenced using the Illumina paired-end platform, and the output sequencing files are processed by intSiteCaller to yield integration site positions on a host draft genome. Integration site data and ChIP-seq data were mapped onto the hg18 genome draft, to match the original draft genome used for analysis of the ChIP-seq data. As in Berry et al.,35 receiver operating characteristic (ROC) areas are used to compare integration sites with random control sites.
INSPIIRED is distributed online as a downloadable virtual machine executable on the Windows, Mac, and Linux operating systems, as well as a GitHub source code repository supported by a Conda software environment (see https://github.com/BushmanLab/INSPIIRED, which also includes detailed instructions for use and test datasets).
E.S., C.N., C.C.B., E.S., M.C., and F.D.B. designed the study; E.S., C.N., C.C.B., E.S., M.C., F.M., M.J.D., P.B., S.H.-B.-A, L.C., and F.D.B. carried out the study; E.S., C.N., C.C.B., E.S., Y.W., A.D., N.M., A.B., K.B., J.K.E., M.C., F.M., S.R., M.J.D., P.B., S.H.-B.-A., L.C., and F.D.B. analyzed the data.
The authors declare that they have no competing interests.
We are grateful to members of the F.D.B. laboratory for help and suggestions. All authors were supported by grants AI 052845, AI 104400, AI 082020, AI 045008, AI 117950, and HL 113252 and ERC Regenerative Therapy grant 269037 from the European Research Council, and an award from the French ANRS. We also acknowledge support from the Penn Center for AIDS Research (grant P30 AI 045008) and the PennCHOP Microbiome Program.
Supplemental Information includes Supplemental Materials and Methods, one figure, one table, and two data files and can be found with this article online at http://dx.doi.org/10.1016/j.omtm.2016.11.003.