Although many vectors exhibit a preference for genes, and even specific genes, few vectors repeatedly integrate into the same precise position with any significant frequency. Rather, most genes harboring frequent insertions show a distribution of insertions into several positions within the same gene. Some vector integrases, such as those for phages
], as well as the Escherichia coli Tn7
], recognize specific DNA sequences or degenerate sequences that exist in mammalian genomes. SB integrates specifically at a TA dinucleotide, and the piggyBac
transposon integrates into the sequence TTAA. Because the oncogenic potential of a vector is related to its propensity to integrate in or near a select few genes, understanding local parameters that affect integration may contribute to our ability to assess the risk associated with these vectors in gene therapy.
For retroviruses and the SB transposon, consensuses sequences have been described surrounding the sites of integration [111
]. Although retroviruses do not exhibit a strong consensus sequence, the nonrandom pattern of integrations and the observation that frequently hit sites did not match the consensus sequences led investigators to examine other properties of DNA sequences surrounding target sites, including structural characteristics of the DNA itself. DNA structural characteristics are based on non-Watson and Crick interactions between nucleotides and encompass deformations to the regular double helix structure caused by interactions between adjacent, planar bases (Figure ). Originally characterized from analysis of crystal structures of DNA bound to histones and other proteins, these characteristics include 'protein-induced DNA deformability', 'A-philicity', and trinucleotide 'bendability'. These properties underlie local variations in DNA structure that are probably relevant to recognition of DNA by transposases and integrases. Early investigations into insertion preferences showed that viruses preferred 'bent' DNA [118
], and several groups have investigated secondary DNA structural patterns in sequences that flank mapped insertion sites for both transposons [115
] and retroviruses [111
] to determine general characteristics of the flanking sequence of 'preferred' integration sites. Similarly, the RAG1/2 protein complex, which has properties akin to the cut-and-paste transposases, recognizes a specific sequence/structure for recombination of antigen receptor genes [132
Different DNA sequences may produce highly similar patterns of DNA secondary structure, and thus common structural patterns that are preferred for integration may be obscured by approaches that analyze sequence alone. Analysis of secondary structure for a DNA sequence is based on translation of a sliding window of two or three bases into structural values for each 'step'. For example, the tendency of a B-form helix to adopt the A-form (A-philicity; Figure ) can be predicted by translating each consecutive (over-lapping) dinucleotide into one of 10 A-philicity values for the 16 combinations of base pair transitions [133
]. Similarly, protein-induced deformability encompasses several changes in base pair orientation from a 'perfect B-form double helix' in a transition between two consecutive base pairs (Figure ). All of these changes can be expressed as a single composite parameter of protein-induced DNA deformability known as Vstep
represents the physical relationships of any two planar base pairs in terms of their relative shifts and angular orientation. In contrast to A-philicity and protein-induced deformability, DNA bendability is best modeled using a sliding window of three bases, with 64 possible trinucleotide bendability values [139
An example of DNA structural analysis for the Tol2
transposon is shown in Figure , in which average structural values for each position flanking an insertion site are plotted and compared with a plot of random sequences. In the case of Tol2
, weak preferences in Vstep
and A-philicity values at specific coordinates are apparent by the peaks in the heavy black lines in Figure (left sides), in contrast to the same averages derived from random sequences (right sides). Overall, the bendability around Tol2
insertion sites exhibits little deviation from a random sequence (Figure ), unlike those preferred by SB transposase (Figure ). Analysis of hundreds of integration sites for potential gene therapy vectors, including viruses as well as transposons, shows that many have subtle preferences for these variables (Figure ). For example, the piggyBac
transposon may favor sites with slightly higher A-philicity, lower bendability, and lower Vstep
values than random sequences. In contrast, 'preferred' SB insertion sites (see below) clearly display a jagged Vstep
pattern and higher bendability. Interestingly, although retroviruses (avian sarcoma virus [ASV], HIV, MLV, and simian immunodeficiency virus) integrate into bent DNA [128
], such as that bound to nucleosomes, our analyses of sequences around viral insertion sites do not indicate a particular preference for bendable DNA (Figure ). A similar, more rigorous approach has been utilized to characterize Drosophila
] and non-LTR retrotransposons in Entamoeba histolytica
], demonstrating that DNA structural characteristics at insertion sites for both elements are significantly different from collections of random sequences.
Figure 3 Approaches to identification of DNA structural characteristics governing insertion site preferences for Tol2 and SB transposons. (a) Averaging of all available insertion sites smoothes trends observed in individual plots. Plot of Vstep profiles of 18 (more ...)
Figure 4 Variability in DNA structural characteristics between insertion sites for various vectors. All (a) A-philicity, (b) trinucleotide bendability, and (c) Vstep values were summed across 12 nucleotides and averaged for all sites of each vector class. (d) (more ...)
For SB, the observation of general structural trends surrounding insertion sites eventually led to the identification of a specific DNA structural pattern governing insertion preference. Vigdal and coworkers [124
] observed that increased DNA deformability and A-philicity were features of a consensus sequence that flanked SB TA insertion sites. Subsequently, Liu and colleagues [131
] mapped about 200 integrations into a relatively small 7 kilobase plasmid sequence and observed that some common integration sites did not share the consensus sequence. These results identified several 'preferred' TA dinucleotides that harbored recurrent integrations. These preferred integration sites exhibited a striking, specific pattern of alternating high and low deformability (Vstep
) values that were absent in TA sites and that were rarely, if ever, used. This led to the conclusion that SB transposase prefers a 'zigzag' Vstep
pattern of DNA deformability [131
], which was later confirmed on a larger, genomic scale [115
]. It remains unknown whether these patterns influence the recognition and binding of the SB transposase, catalysis of the transposon integration, or some other mechanistic factor.
This analysis was repeated for other vectors, including piggyBac
, P-elements, and several retroviruses [115
]. However, only weak structural signatures were detected, which were no more informative than the weak consensus sequences previously identified. A key difference in the SB screen was the level of saturation of a small target, which allowed for the identification of highly preferred sites over nonpreferred TA dinucleotides. In contrast, the datasets for the other vectors were derived from a relatively small number of insertions into mammalian genomes, which were insufficient to obtain an initial set of preferred sequences. Because nonpreferred sites are likely to vastly outnumber preferred sites in the genome for most vectors, any genome-wide screen will produce a mix of indistinguishable preferred and nonpreferred sites. For example, we have estimated that of the approximately 200,000,000 TA sites in a human genome, only about 10% fall into the preferred category [115
], although in the screen conducted by Yant and coworkers [106
] 189 out of 573 (33%) genomic SB insertions were classified as preferred sites. Analysis of the bendability of all SB sites mapped in the screen reported by Yant and coworkers shows a peak at the center of the insertion site that is defined by the central TA dinucleotide. However, when only the preferred sites are analyzed, the surrounding nucleotides exhibit a much greater level of bendability (Figure ). This effect is in spite of the fact that the preferred sites were identified based on protein-induced deformability, Vstep
, which is distinct from DNA bendability. The lesson from these studies is that most genome-wide datasets (particularly from experiments involving some form of genetic selection) will probably show a similar dilution effect of preferred sites by greater numbers of nonpreferred sites.
There is a caveat to the analyses discussed up to this point; they all assume that the structures around integration sites have an absolute center of reference, defined by the site into which the vector integrated. Such analyses could miss structural patterns that are not strictly position specific. For instance, an integrase may have preference for a local region that is highly bendable or deformable, but it may not have a requirement for a particular pattern (or sequence). To account for this, we have examined a parameter called 'jaggedness', which we define as the degree to which Vstep values alternate from high to low, as in the preferred 'zigzag' sites for SB. We calculated jaggedness by taking the sums of the absolute values of the differences between adjacent Vstep values across a sequence, so that a jagged/zigzag site would have a higher total value than a flat, basal site, which should have a jaggedness value close to 0. Jaggedness values for several vectors are shown in Figure . Although jaggedness values at insertion sites are similar to Vstep values for most vectors (with the possible exception of Tol2), the jaggedness patterns show a high degree of variability across genomic sequences and are somewhat independent of Vstep patterns (for instance, the c-myc gene; Figure ).
Figure 5 Insertion prediction for transposon vectors surrounding the c-myc locus on mouse chromosome 15. A 3 kilobase sequence from the mouse c-myc locus (from 61,813,400 to 61,816,400 base pairs) harboring 37 retroviral insertions submitted to the Mouse Retrovirus (more ...)