|Home | About | Journals | Submit | Contact Us | Français|
Novel family of putative homing endonuclease genes was recently discovered during analyses of metagenomic and genomic sequence data. One such protein is encoded within a group I intron that resides in the recA gene of the Bacillus thuringiensis 0305ϕ8–36 bacteriophage. Named I-Bth0305I, the endonuclease cleaves a DNA target in the uninterrupted recA gene at a position immediately adjacent to the intron insertion site. The enzyme displays a multidomain, homodimeric architecture and footprints a DNA region of ~60bp. Its highest specificity corresponds to a 14-bp pseudopalindromic sequence that is directly centered across the DNA cleavage site. Unlike many homing endonucleases, the specificity profile of the enzyme is evenly distributed across much of its target site, such that few single base pair substitutions cause a significant decrease in cleavage activity. A crystal structure of its C-terminal domain confirms a nuclease fold that is homologous to very short patch repair (Vsr) endonucleases. The domain architecture and DNA recognition profile displayed by I-Bth0305I, which is the prototype of a homing lineage that we term the ‘EDxHD’ family, are distinct from previously characterized homing endonucleases.
Homing endonuclease are proteins that drive the dominant, non-Mendelian inheritance of their own reading frames by catalyzing a double-strand break (DSB) at specific DNA target sites in a recipient genome (1). The DSB is repaired via homologous recombination, using an allele of the target gene that contains the homing endonuclease gene (HEG) as a repair template; this copies the HEG into the site of DNA cleavage. HEGs are often embedded within self-splicing introns or inteins. The inclusion of a self-splicing genetic element as part of the mobile DNA allows invasion of highly conserved regions in crucial host genes without disrupting their essential functions. The coevolution of a homing endonuclease, its surrounding intron or intein, and the host gene results in an intricate network of genetic and physical interactions that affect the expression, specificity and invasiveness of the mobile element (2).
To succeed as mobile genetic elements, homing endonucleases must balance competing requirements for high DNA cleavage specificity (to avoid host toxicity) versus the need for reduced fidelity at various base pairs in their target site (to facilitate genetic mobility in the face of sequence drift within potential DNA target sites). Homing endonucleases and associated mobile introns and inteins that have successfully achieved this balance are encoded in genomes of bacteria, organelles of fungi and algae, single cell protists and in the bacteriophage and viruses that accompany and infect those organisms.
There are five well-characterized families of homing endonucleases, which are each classified according to their unique protein folds and distinct catalytic active sites and DNA cleavage mechanisms (1). Members of the ‘LADLIDADG’ family, so named on the basis of their most conserved protein motif, are found in eukaryotic organellar and archaeal genomes, and are the most specific of the known homing endonucleases (3). They exist both as homodimers that are limited to recognition of palindromic and near-palindromic target sites, and as pseudosymmetric monomers (where two structurally similar domains are tethered together on a single protein chain) that can target completely asymmetric targets. Members of the ‘His-Cys box’ and the ‘PD…(D/E)-xK’ families (found in protists and in cyanobacteria, respectively) also form multimeric protein complexes that recognize symmetric target sequences (4,5). In contrast, members of the HNH and GIY-YIG families (usually found in bacteriophage) display multidomain structures (corresponding to separate DNA binding and catalytic regions) and adopt highly elongated conformations when bound to DNA (6–8). As a result, those proteins usually recognize long non-palindromic sequences with significantly reduced fidelity (9,10).
Recently, a novel type of fractured gene structure, containing separately encoded halves of self-splicing inteins that interrupt individual host genes in the same locus, was discovered during an analysis of environmental metagenomic sequence data collected by the Global Ocean Sampling (GOS) project (11). These split intein sequences are found in a diverse set of host genes that are primarily involved in DNA synthesis and repair. The inteins are themselves often interrupted either by open reading frames (ORFs) that encode members of the GIY-YIG homing endonuclease family, or by novel ORFs that do not exhibit significant sequence similarity to previously characterized homing endonuclease families. Homologs of those uncharacterized ORFs were also found associated with introns or as free-standing genes. In total, 15 members of the newly discovered gene family were described, including two within previously annotated recA genes in the NCBI sequence database.
The C-terminal region of this newly identified protein family displays limited sequence homology [typically corresponding to e-values from a BLASTP (12) <10−3] to the catalytic domain of the very short patch repair (‘Vsr’) endonucleases (enzymes that generate a 5′ nick at T:G mismatches in newly replicated DNA and thus stimulate DNA nucleotide excision repair) (13,14). Several catalytic residues from Vsr endonucleases are conserved across all members of the new gene family, and form the composite sequence motif EDxHD. These residues include an essential aspartate that coordinates a catalytic magnesium ion, a histidine believed to act as a general base and a neighboring aspartate residue. Based on the presence of a recognizable endonuclease catalytic domain within these intron- and intein-associated microbial ORFs and the conservation of catalytic residues within that domain, this gene family was therefore hypothesized to encode a novel lineage of homing endonucleases.
These ORFs also display sequence signatures in their N-terminal regions that are similar to those found in several nuclease associated modular DNA-binding motifs (‘NUMODs’) (15). NUMODs are frequently found in other homing endonucleases from bacteriophage, such as the GIY-YIG endonuclease I-TevI (8) and the HNH endonuclease I-HmuI (6). In those cases, the NUMODs are found at the C-terminal end of those proteins (a reversed domain organization compared to the metagenomic ORFs described above). The extended conformation that NUMOD regions adopt upon DNA binding dictates that they make relatively sparse contacts across their long target sites.
A representative member of this novel homing endonuclease family, which we have named I-Bth0305I, was identified in the NCBI sequence database during the same genomic analysis (11). This ORF is located within a group I intron that interrupts the RecA gene of Bacillus thuringiensis 0305ϕ8–36 bacteriophage. Experiments described in this manuscript describe the binding site, cleavage pattern and specificity of I-Bth0305I, and the crystal structure of its catalytic domain. These experiments demonstrate that I-Bth0305I is a site-specific endonuclease that forms a homodimer and contacts a region of DNA up to 60bp in length. Unlike many bacteriophage homing endonucleases (which tether relatively nonspecific catalytic nuclease domains to sequence-specific DNA-binding domains, and therefore display significant specificity for DNA base pairs that are located some distance from the site of cleavage), I-Bth0305I displays its greatest specificity across the central residues of its recognition site (spanning the positions of DNA cleavage and intron insertion), and little additional sequence specificity at positions more distant from the cleavage site. The crystal structure of the I-Bth0305I catalytic domain confirms that members of this putative homing endonuclease family share a common ancestor with the Vsr mismatch repair endonuclease, and supports a similar mechanism for DNA strand cleavage.
Sequences of Vsr-like putative homing endonucleases (Supplementary Data) were identified in the NCBI sequence databases and JCVI data using BLAST sequence searches and BLIMPS motif searches as previously described (11). Multiple sequence alignments were constructed with MEME (16), MACAW (17), DIALIGN-TX (18) and GLAM-2 (19) programs.
RecA gene regions corresponding to the I-Bth0305I cleavage and intron insertion site were identified by searching complete genomes of bacteria from the NCBI with Blocks database block IPB001553D using the BLIMPS program. The identified regions and 0305ϕ8–36 bacteriophage intron-inserted region were aligned using the SeAl program (http://tree.bio.ed.ac.uk/software/seal/) to form a 1368 sequences multiple alignment. Sequence logo of this region and of its translated protein product were constructed as previously described (20), using a total of four characters and equal expected base frequencies for the DNA sequence logo.
I-Bth0305I NUMOD conserved motifs were identified by analyzing I-Bth0305I and sequences similar to its N-terminal non-catalytic region. One such motif, typically appearing twice in each sequence, was identified. This motif was found to be significantly similar to the ‘NUMOD 2 motif’ (15) and to various DNA-binding HTH motifs from the Blocks release 14.3 database (21) [including IPB000792 (LuxR bacterial regulatory proteins), IPB000831 (Trp repressors) and IPB002197B (FIS bacterial regulatory proteins)] using the LAMA program (22). The specified blocks were used to predict the position of the HTH DNA-binding region within the NUMOD 2 motifs of I-Bth0305I.
Synthetic genes encoding I-Bth0305I and several additional homologs that were identified in an earlier metagenomic analysis (11) were ordered from Genscript (New Jersey, USA) with codons optimized for protein expression in Escherichia coli (Supplementary Figure S1). These reading frames were ligated into an in-house pET15-HE vector (Supplementary Figure S2) for initial protein trials. Subsequently, the reading frame encoding I-Bth0305I was subcloned into a pGEX-6p-3 expression vector, for production of the protein as a fusion with glutathione-S-transferase (GST). Inactivated constructs of the full-length protein were generated by mutating either the putative general base (H213A) or a putative metal-binding residue (D222A). A construct corresponding to the isolated predicted catalytic domain was generated by subcloning amino acids 167 through 266; two point mutations corresponding to D196A and H213A were introduced to allow overexpression by inactivating the construct. To facilitate crystallographic phasing, an additional point mutation (L180M, which could be expressed as a selenomethionyl residue) was introduced at a position predicted to be a surface residue on the opposite side of the protein from the bound DNA.
For initial overexpression trials of I-Bth0305I and its homologs, the pET-15HE expression vectors containing the endonuclease reading frames were transformed into BL21(DE3)RIL cells using a standard heat shock transformation protocol: add 5ng plasmid to 50µl competent cells, incubate on ice for 2min, heat shock for 30s at 42°C, incubate on ice for 2min, add 200µl SOC media, shake at 220rpm at 37°C for 20min, then plate on LB agar plates with 0.1mg/ml ampicillin. Single colonies were picked and grown in LB media with 0.1mg/ml ampicillin. Starter cultures of 3ml were grown overnight to saturation and then transferred to 1l of LB media which was incubated at 37°C at 220rpm until cells reached mid log phase (OD 0.5–1.1). Cultures were then placed on ice for 20–60min before adding IPTG to 1mM. Cells were harvested by centrifugation and examined by SDS–PAGE electrophoretic analyses (Supplementary Figure S3).
Purification of I-Bth0305I to homogeneity was then carried out using protein expressed as a GST fusion protein from pGEX-6p-3 bacterial expression vector. GST-tagged I-Bth0305I was overexpressed at 16°C while shaking at 220rpm for 16–20h. The cell pellet was resuspended in 45ml of lysis buffer (50mM Tris pH 7.0, 250mM NaCl) before being sonicated on ice for 3×30s (with 1min cooling periods) in a 50ml polypropylene tube using a high-power setting with a microtip. The resulting cell lysate was centrifuged to pellet insoluble material. The supernatant was then incubated with 2ml of washed Sepharose-glutathione 4B beads (GE life sciences) using a gentle rocking motion at room temperature for 30min. Beads were collected using a gravity flow columns and washed with 40ml of high salt wash buffer (50mM Tris pH 7.0, 2M NaCl). Beads were washed again with lysis buffer. Finally, 2ml of lysis buffer was added to the beads along with 80 U of PreScission protease. The mixture was incubated for 16h with a gentle rocking motion at 16°C. Resulting protein was eluted directly from the beads and purified further via heparin affinity chromatography. An amount of 2ml of protein at a concentration roughly 2mg/ml was run over a heparin column in lysis buffer. Following binding, a 40ml gradient was applied where the NaCl concentration was increased from 250mM to 2M NaCl. Pure I-Bth0305I eluted at ~1M NaCl and was found to be >95% pure as estimated by electrophoretic analysis.
Purified I-Bth0305 was used to digest several phage DNA samples to assess the extent of activity. Phage lambda DNA was chosen as a substrate for further testing. Aliquots containing 30µg of phage lambda DNA was digested for 1h at 37°C with a series of 2-fold dilutions of I-Bth0305I ranging in concentration from 20ng/µl (0.65µM) to 9.8pg/µl (0.6nM) as shown and further illustrated in Supplementary Figure S4. The DNA was extracted with phenol and chloroform, precipitated, and resuspended in 10mM Tris 1mM EDTA, and then diluted in water to 10ng/µl for use as template for sequencing reactions. Sequencing reactions were carried on the respective DNA samples using 19-base oligonucleotide primers (IDT, Inc.), which were complementary to staggered positions along each DNA strand. Sequencing reactions were performed on an ABI 3730xl capillary sequencer. Output sequence traces were assembled and aligned to the reference lambda genome (Genbank file: NC_001416). Assembled sequence traces were examined by eye for signals indicative of strand-cleavage comprising a significant drop in average peak trace height following a spurious additional ‘A’ peak (in the case of forward sequencing reactions) or a spurious additional ‘T’ peak (in the case of transposed reverse sequencing reactions).
Non-competive cleavage digests (corresponding to experiments depicted in Figures 2a and and4)4) were performed using equimolar concentrations (500nM) of enzyme and linear DNA duplex substrates. The DNA substrates were generated via PCR from plasmid templates. Run-off sequencing using Taq polymerase on the digested product generated from the recA gene sequence from 0305ϕ8-36 bacteriophage identified the site of cleavage in that target site (Figure 2d and Supplementary Figure S5).
A 120-bp polymerase chain reaction product corresponding to the uninterrupted RecA gene sequence from bacteriophage Bth0305ϕ8-36, with the endonuclease cleavage site positioned at its center, was generated using either of two radiolabeled PCR primers. An amount of 0.1pmol of this radiolabeled PCR product was incubated with 20µM I-Bth0305I in binding buffer (50mM Tris pH 7.0, 60mM KCl, 1mM MgCl2, 1mM 2-mercaptoethanol, 2mg/ml Bovine serum albumin) for 5min at room temperature. Following binding, 10µl of DNAseI (Roche pharmaceuticals) was added and allowed to react for 5min at room temperature. After this incubation, reactions were quenched with 160µl of stop solution (20mM EDTA, 2mg/ml salmon sperm DNA). Phenol extraction and ethanol precipitation separated the digested PCR product from I-Bth0305I and BSA in the reaction. Resulting samples were loaded on a 6% polyacrylamide DNA sequencing gel at 1700V for 1h 50min.
Aliquots of a DNA duplex corresponding to a 67-bp region of the 0305ϕ bacteriophage RecA gene sequence, centered around the endonuclease cleavage site were injected into I-Bth0305I (300µl, 20µM) (Supplementary Figure S6). Prior to analysis, both samples were dialyzed into identical buffers corresponding to 20mM HEPES pH 7.6, 50mM NaCl, 10mM CaCl2. The reference cell temperature was kept constant at 30°C with a stirring speed of 1000rpm. In total, there were 16 injections, with the first injection being half the volume and duration as the remaining injections (2.5µl over 5.0s, 180s between each injection). The binding analyses were performed in triplicate.
A complex corresponding to a catalytically inactivated nuclease domain (residues 167–266, containing active site point mutations D196A and H213A) was overexpressed and purified in a manner similar to full-length I-Bth0305I, except that the heparin purification step was omitted. Crystals of this construct were grown via the hanging drop method against a reservoir containing 100mM LiSO4, 100mM Tris pH 7.4–8.4, PEG 4000 27–30 w/v percent in 3–4 days. Crystals of native protein and of selenomethionyl-derivatizd protein grew under similar conditions, and both were transferred into a cryoprotectant solution (100mM LiSO4,100mM Tris pH 8.5, 30% PEG 4000, 20% sucrose) and then flash frozen in liquid nitrogen. Data collection was performed at Beamline 5.0.2 at the Advanced Light Source (ALS) synchrotron facility at Lawrence Berkeley National Laboratory (Berkeley, CA, USA). Data integration and scaling was performed using program HKL2000 and all subsequent analysis was performed using the PHENIX crystallography suite. A single selenomethionine data set was used to solve phases, generate an electron density map, and build a molecular model of the nuclease domain. This model was then used to solve phases for the native data set via molecular replacement, and the final structure was built and refined to 2.2 Å resolution. The native data set was used for final refinement, even though it was slightly lower resolution (2.2 Å versus 2.15 Å) because the merging statistics for that dataset were otherwise superior to the Se-Met data (Table 1).
Genes encoding several individual representatives of the Vsr-like endonuclease gene family identified in the metagenomic analyses (11), as well as the protein we have named I-Bth0305I, were each synthesized as codon-optimized reading frames for bacterial expression in E. coli and then subcloned into a modified pET (Novagen, Inc) vector that incorporates an N-terminal, 6-histidine affinity purification tag that can be removed by proteolytic digests with thrombin (Supplementary Figures S1 and S2). The resulting constructs displayed a wide range of behaviors during bacterial overexpression and purification (Supplementary Figure S3). Of the seven protein constructs tested, four were observed to form insoluble inclusion bodies regardless of induction conditions. Out of the remaining ORFs, the construct corresponding to I-Bth0305I significantly reduced the growth rate of the bacterial culture after IPTG induction and was observed in the soluble fraction of lysed cells. This construct was subsequently recloned into a GST-fusion expression vector (pGEX-6P-3) in the hopes that the larger affinity partner might reduce DNA binding or cleavage activity during expression, allowing improved growth and recovery of expressed protein. The resulting fusion protein was soluble, easily recovered from clarified cell lysate, and could be subsequently purified using affinity chromatography and liberated from its GST fusion partner via a proteolytic digestion as described in ‘Materials and Methods section’. The yield of this protein was ~1.5mg/l of culture, and the resulting protein could be concentrated to at least 9mg/ml in a storage buffer corresponding to 250mM NaCl, 50mM Tris pH 7.0, 5% (v/v) glycerol.
The I-Bth0305I reading frame encodes a protein that is 266 amino acids in length, corresponding to a predicted molecular weight of 30912 Da. The surrounding group I intron within the bacteriophage 0305ϕ8–36 RecA gene is 801nt in length; the start codon for the putative endonuclease reading frame is found 88nt from the start of the intron. The protein ORF interrupts the P5 element in the canonical representation of the group I intron's secondary and tertiary structure (23). As described in the original analysis of this protein family, I-Bth0305I displays an N-terminal region with two copies of sequences corresponding to NUMOD 2 DNA-binding motifs (15), and a C-terminal region that shares homology with the catalytic domain of the Vsr DNA mismatch repair endonuclease (14). Further analysis, using homologs of the I-Bth0305I N-terminal region, indicated that the two NUMOD regions might span a putative helix–turn–helix (HTH) sequence-specific DNA-binding region motif. Using the conserved sequence regions of the Vsr-like endonuclease proteins (11) we identified additional members of this family including bacteriophage Hef type homing endonucleases (24) and a bacterial protein from Corynebacterium glutamicum ATCC 13032 (Supplementary Data). These sequences allowed us to extend and refine the conserved sequence regions of the Vsr-like endonuclease family, including the identification of a fifth putative active site residue (Figure 1).
These sequence relationships were exploited at several points in this study to generate truncated expression constructs corresponding to isolated structural regions of the protein, and to design catalytically inactivating point mutations in the catalytic domain. These constructs were subcloned into the same bacterial expression vector described above, and purified as described in ‘Materials and Methods’ section. The overall yield of isolated N- and C-terminal regions of I-Bth0305I were ~1 and 3mg/l, respectively.
We next tested the ability of full length, wild-type I-Bth0305I to cleave a DNA substrate corresponding to the intron-minus allele of the RecA gene, and compared that cleavage activity with substrates containing DNA sequences that correspond to an ‘intron-plus’ recA allele. This experimental design was based on the known genetic propagation mechanism of most homing endonucleases, which cleave a target site within an intron- or intein-minus allele of their host gene, but usually do not cleave the same allele when it contains the inserted intervening sequence (25). In our experiments, efficient cleavage of the DNA substrate corresponding to the uninterrupted RecA gene was observed (Figure 2a). Substrates containing the intron–exon junction sequences of the bacteriophage recA gene were not cleaved by the enzyme under any conditions (Figure 2b), indicating that the enzyme only cleaves the uninterrupted recA allele prior to intron insertion.
In order to further define the actual target site and cleavage pattern exhibited by the endonuclease, as well as to establish the overall specificity of the enzyme, two separate experiments were conducted. In the first, lambda phage DNA (a 48.5-kb double-stranded DNA construct of known sequence) was used as a substrate in a series of digests with variable concentrations of purified endonuclease. All resulting product fragments were identified and sequenced using a comprehensive set of oligonucleotide primers that cover the entire length of both DNA strands. An alignment of the nicked and cleaved DNA sequences produced in this experiment identified the target site preference for the enzyme. In the second experiment, a 500-bp substrate corresponding to the recA sequence from the 0305ϕ bacteriophage was digested to completion, and both product strands were subjected to run-off sequencing using TaqI polymerase. When analyzed together, these two experiments produced an unambiguous assignment of the enzyme's target site preference and cleavage activity.
Digestion of lambda DNA generated a list of target sites that were hydrolyzed by the endonuclease (Supplementary Figure S4). Alignment of these genomic sequences resulted in a target site consensus corresponding to 5′-T-T-x-G-x6-C-x-A-A-3′ (Figure 2c). This 14-bp target site displays pseudopalindromic symmetry, with the ‘TTxG’ sequence in the left half-site complementary to the ‘CxAA’ sequence in the right half-site. The majority of the target sites in these assays were nicked on either the top or bottom strand (at positions that considered together would correspond to a two base, 5′ overhang). One site that displayed a sequence that was particularly close to the consensus described above (differing at only 1bp out of 6) was cleaved on both strands and thereby produced the actual two base, 5′ overhang and cleavage pattern.
Direct run-off sequencing of the product strands produced from digests with the actual RecA-coding sequence as a substrate resulted in identification of a target site (5′-TTcGgtgatcCaAA-3′) and cleavage pattern that agree precisely with the results described above (Figure 2d and Supplementary Figure S5). Therefore, it appears that the enzyme cleaves a partially symmetric DNA target site located immediately upstream of the intron insertion site in the recA target and requires conservation of most of the ‘TTxG’ consensus target sequence in both DNA half-sites in order to generate a DSB. When limiting our analysis of the lambda DNA cleavage products to only those targets that were most efficiently nicked or cleaved (at least 90% digestion of either strand), the resulting information content and logo plot across the central 6bp was observed to agree more closely with the recA target site sequence.
After establishing the cleavage site in the RecA host gene, we next determined the DNAse I footprint of the enzyme bound to its DNA target (Figure 3). A catalytically inactive variant of I-Bth0305I (D222N, containing a mutation of a putative catalytic asparate residue that was observed to prevent cleavage activity) was incubated with 120-bp probe that corresponded to the RecA-coding sequence. The region of the complementary strand that was protected by the bound enzyme from DNAse I digestion was determined in a separate experiment. In both cases, a region of ~60nt, corresponding to 30bp that extend from each side of the center of the cleavage site, was protected from DNAse I cleavage. Subsequently, the binding of I-Bth0305I to a synthetic DNA duplex corresponding to this target site sequence was evaluated using multiple independent isothermal titration calorimetry experiments and determined to correspond to an exothermic binding reaction with a dissociation constant (KD) of 24±6nM (Supplementary Figure S6).
Having determined the extent of DNA backbone protection corresponding to the bound endonuclease footprint and the affinity of the binding interaction, we then further assessed the sequence specificity displayed by the endonuclease in a series of digests using variants of the wild-type DNA substrate (Figure 4). These experiments indicated that the enzyme exhibits the highest specificity across the central 14bp of its target site. A series of substrates that contained either three consecutive transverted base pairs (e.g., 5′-ATC-3′ → 5′ TAG-3′) or that contained a series of AA insertions were used as substrates in parallel assays. In these experiments, cleavage activity was reduced most significantly when the DNA sequence that immediately spans the central site of catalysis was mutated. Similar perturbations introduced on either side of this central target site region were well tolerated by the enzyme.
In competitive cleavage digest experiments (corresponding to Figures 2b, b,55 and and6),6), up to four different substrates, each at 3.5nM concentration, were simultaneously digested with 70nM of I-Bth0305I for 30min at 37°C. The substrates were of length 2200, 1900, 1600 or 1300bp and each contained a putative target site exactly at the center of the DNA construct. All digest were assayed using 1.2% agarose gel electrophoresis and relative substrate and product concentrations were quantitated using the ImageJ program. All digests were performed in 50mM Tris pH 7.6, 50mM NaCl and 1mM MgCl2.
Alteration of the DNA sequence at the more distant 5′- and 3′-ends of the I-Bth0305I contact region (i.e. at each end of the target site previously established by DNAse I footprinting) had a much less significant effect on DNA cleavage (Figure 5). In these experiments, a series of long DNA duplex substrates (each of which were 1–2kb in length) that contained targets with gradually decreasing regions of the RecA target sequence were assayed in parallel, competitive cleavage digest experiments. Reduction of the length of the RecA gene sequence within these long substrates from a 64-bp region (corresponding to the extreme limits of the protected region observed in DNAse I footprinting assays) to 54bp resulted caused little or no loss of cleavage activity. In contrast, a slight reduction in activity was observed when a 33-bp RecA target sequence was present, and a more significant reduction in activity was observed when the RecA target is sequence was reduced to only 23bp. In no case, however, was the loss of cleavage activity in these experiments as pronounced as when as few as 3bp in the center of the target site were mutated.
Having established by a variety of methods that sequence specificity of DNA cleavage is highest across the central base pairs of its target site, we next generated a matrix of point mutations of the RecA target site (corresponding to each of the three possible single base pair substitutions at each of the central positions) and tested each for their relative ‘cleavability’ using in vitro digests (Figure 6). Although the previous experiments described above demonstrated that simultaneous mutation of as few as three consecutive base pairs was sufficient to significantly impair cleavage, mutation of individual base pairs had relatively little effect on cleavage under the same reaction conditions. Only three individual nucleotide substitutions in the recA target site (at positions −1, −2 and −5 in the left half-site) showed any measurable effect on cleavage efficiency. These three base pairs correspond to positions in that half-site that are not symmetrically conserved with their counterparts in the right-half site.
Therefore, while the sequence specificity of the cleavage reaction is clearly most significant across the central 14-bp positions of the I-Bth0305I target site, the overall information content across this region (as measured by the reduction in cleavage activity caused by individual base pair substitutions) is very evenly distributed as compared to many other homing endonucleases that have been characterized (26–29), such that only multiple simultaneous base pair substitutions result in a significant loss of cleavage efficiency.
Size exclusion chromatography experiments showed that the apparent mass of both the full-length enzyme (containing a catalytically inactivating D222N mutation) and of the isolated catalytic domain (containing a D196A mutation) were approximately twice the value that was predicted based solely on the length of their protein chains (62kDa versus 31kDa for the full-length protein, and 18kDa versus 12kDa for the catalytic domain) (Supplementary Figure S7). This result was confirmed by dynamic light scattering measurements of the catalytic domain. A different point mutant within the isolated catalytic domain (H213A, corresponding to the predicted location of the active site general base) gave a reduction in apparent mass and the dynamic radius by ~50%. These results indicate that the full-length endonuclease and its isolated catalytic domain form stable dimers in solution and that the dimerization interface is disrupted by mutation of His 213. This result agrees with the independent observation, described above, that the sequence in the recA gene that immediately surrounds the cleavage site displays significant pseudopalindromic symmetry (Figure 2). The presumed role of H213 in catalysis [based on prior mutational studies and conservation of the comparable residue in the Vsr repair endonuclease (13–14)], versus its observed importance for dimerization of I-Bth0305I may indicate that dimerization and catalytic activity of the homing endonuclease are structurally linked, with that particular residue playing an important role for both properties. Structural studies of the isolated nuclease domain with the H213A mutation (described below) demonstrate that the A213 residue is significantly displaced from its position in the Vsr active site.
The crystal structure of a catalytically inactive double-point mutant (D196A/H213A) of the C-terminal region of I-Bth0305I (containing residues 167–266, which displays sequence homology to the Vsr mismatch repair endonuclease) was determined and refined to 2.2 Å resolution (PDB ID: 3R3P). Selenomethionyl-derivatized protein was used as the sole source of de novo phase information in order to avoid model bias that might arise from phase determination via molecular replacement. The final refined model (Table 1), contained residues 167–263 from the isolated catalytic domain (three residues from the C-terminus were unobserved and presumed to be disordered in the crystal). Two copies of the catalytic domain were present in the asymmetric unit; the all-atom RMSD for those two protein chains is 0.33 Å. Because the H213A mutation in this domain was previously shown to block dimerization, the interface between these two observed subunits is believed to represent a non-physiological interaction that is formed in the crystal lattice.
The structure of the catalytic domain consists of a central β-sheet with mixed parallel and anti-parallel topology surrounded by four α-helices. The structure of the I-Bth0305I catalytic domain superimposes against the homologous region of Vsr endonuclease (PDB ID: 1VSR) (13,14) with an RMSD of 8.76 Å across 61 atoms (Figure 7). The structure of the central β-sheet within the I-Bth0305I catalytic domain differs significantly from that of Vsr. This region within I-Bth0305I is twisted, as compared to a more saddle-shaped structure within Vsr. Furthermore, while this β-sheet contains four β-strands in both structures, only three strands are found to superimpose between the two enzymes; the two enzymes display their fourth (non-conserved) strands at opposite sides of the core β-sheet. As well, a zinc-binding sequence motif found in Vsr is missing from the loop that connects β3 and α2 in I-Bth0305I, and zinc atoms are not observed in the structure.
The α-helices that are observed in I-Bth0305I are also diverged from their corresponding structural elements in Vsr. First, the short I-Bth0305I helix α3 (residues 80–84) is instead a loop in Vsr. Furthermore, helix α2 in I-Bth0305I is considerably shorter (at 14 residues) than the corresponding 25-residue helix in Vsr (spanning residues 82–107) that is inserted into the DNA major groove in its DNA-bound structures. The differences in the structures between the two nuclease domains are critical determinants for their different functions. In Vsr, two tryptophan residues (W68 and W86) are intercalated into the DNA immediately adjacent to the T:G mismatch in that enzyme's substrate target and appear to play a key role in recognition of that particular structural lesion in the DNA. In I-Bth0305I (which instead recognizes a fully paired DNA target sequence corresponding to vicinity of the intron insertion site) the corresponding region instead corresponds to a short flexible loop.
While the elaborations upon the core fold of the two enzymes are significantly diverged, their active site residues are closely comparable (Figure 7). Residues that superimpose very closely include Asp 196 in I-Bth0305I (which is Asp 51 in Vsr and is mutated to Ala in the crystal structure), Asp 222 (Asp 97) and Asn 208 (His 64). An additional residue in Vsr (His 69) that is thought to play a role in catalysis is conserved in the I-Bth0305I sequence (as His 213), but is located in a significantly different conformation in the two structures. In the structure of the I-Bth0305I catalytic domain, this residue is found at a surface-exposed position in the structure that is involved in crystal lattice contacts, which appears to perturb its position and rotameric conformation relative to the surrounding active site. A final acidic residue (Glu 170 in I-Bth0305I, corresponding to Glu 25 in Vsr endonuclease) might also participate in catalysis; this amino acid is well conserved but is found in an otherwise weakly conserved region (Figure 1).
Two bacteriophage HEs have been previously crystallized and studied biochemically in great depth: the GIY-YIG endonuclease I-TevI (which drives intron homing into a thymidylate synthase host gene in T4 bacteriophage) (30) and the HNH endonuclease I-HmuI (which drives intron homing into a DNA polymerase host gene in the Bacillus SPO1 bacteriophage) (31). Both of those enzymes, as well as their closest homologs (I-BmoI and I-BasI, respectively) appear to bind their DNA targets as monomers (9,10), with protected DNA regions extending ~30–40bp downstream from their intron insertion site. These enzymes discriminate between intron-plus and intron-minus alleles of their host genes through a small number of sequence-specific interactions near the site of cleavage. Whereas I-HmuI acts as a strict monomer to nick its DNA target near its intron insertion site (apparently relying upon subsequent conversion of the nick to a DSB to promote homing) (10), I-TevI is observed to directly generate a DSB and a two base, 5′ overhang 23- and 25-bp upstream of the intron insertion site (9). The ability of I-TevI to directly generate a DSB may require transient dimerization of catalytic domains at the site of DNA cleavage; however, this behavior has not been demonstrated directly.
In contrast, I-Bth0305I forms a stable dimer in the absence of DNA, contacts up to 60bp of DNA and cleaves a pseudo-palindromic target in the RecA host gene. If each individual subunit of the I-Bth0305I homodimer contacted a length of DNA target that was similar to the monomeric I-TevI and I-HmuI subunits, then the observed 60-bp contact region would simply correspond to two 30-bp DNA half-sites. The homodimeric architecture of I-Bth0305I (in the absence of bound DNA) may predispose the enzyme to recognize and cleave target sites that display greater palindromic symmetry than has been observed for enzymes that initially bind their DNA targets as monomers.
The I-Bth0305I endonuclease displays a bipartite, multidomain architecture and harbors a catalytic domain that is fused to a predicted DNA-binding region, that contains two NUMOD sequence elements that likely bind specific DNA sequences using a HTH motif. The conclusion that can be drawn from all of the experiments in this study is that the enzyme homodimerizes through interactions between nuclease domains, and that interactions of those domains with the DNA generate the majority of target site specificity at the central 14bp of the target. The remainder of protein–DNA contacts, made at positions outside of this central pseudopalindromic region, are largely nonspecific and presumably made by the N-terminal DNA-binding regions that contain the NUMOD motifs (Figure 8a).
A similar bipartite domain organization has previously been observed in both I-HmuI, I-TevI and their homologs (6–8). However, the domain organization of this new homing endonuclease family (containing an N-terminal DNA-binding domain fused to a C-terminal nuclease domain) is reversed as compared to those previously characterized bacteriophage endonucleases, and involves an entirely different nuclease core structure, which together suggest a difference in the evolutionary history of this bacteriophage-specific homing endonuclease lineage.
The specificity profile displayed by I-Bth0305I is unusual as compared to other well-studied, phage-derived homing endonucleases in which almost all sequence specificity of cleavage appears to be focused near the site of cleavage, with relatively little specificity derived from contacts between the HE and more distal positions in the DNA recognition site. In contrast, the HNH and GIY-YIG endonucleases appear to display bipartite recognition patterns, with limited numbers of sequence-specific contacts made both by the nuclease domains near the sites of DNA strand cleavage, and additional sequence-specific contacts made by the more distant DNA-binding regions of the enzyme. However, close examination of sequence specificity profiles of enzymes such as I-TevI (a GIY-YIG enzyme) (9) and I-HmuI (an HNH enzyme) (10) both indicate that the base pair identities in their target sites that are most critical for recognition and cleavage are also located near the site of cleavage, and are generally bases that are particularly well conserved within the coding sequence of the target host gene. This feature of DNA specificity is displayed by virtually all known families of homing endonucleases (26–29).
The specificity profile of I-Bth0305I suggests that several of the central 14bp surrounding the intron insertion site are most specifically recognized by the enzyme and therefore might be a functionally important region of the RecA host gene. To investigate this hypothesis, we examined the conservation of the RecA-coding DNA and translated protein sequences corresponding to the endonuclease target region, by generating a multi-sequence alignment of 1368 recA genes, including the 0305ϕ8–36 bacteriophage gene without its intron (Figure 8b). The conservation of the positions in the coding sequence and protein multiple alignments was calculated using information theory measures and taking into account background frequencies of amino acids, and differing similarities between the aligned regions (20,32). This analysis demonstrates strong conservation at 11 out of the central 14bp of the endonuclease target site, and additional, stronger conservation of the DNA and protein sequence downstream of the intron-insertion site.
The amino acid sequence of the bacteriophage RecA protein corresponding to the 20 residues that are encoded by the DNA region that is contacted by I-Bth0305I is somewhat diverged from the overall RecA consensus. Nine of those residues from the bacteriophage protein correspond to the top residue in the RecA protein logo plot, three of which (F224, G225 and P227) are encoded within the central region of the target site. The specificity profile of I-Bth0305I is somewhat correlated with the RecA-coding sequence in that region: base pair positions that are recognized by the enzyme with above average preference include the first two positions of the codons encoding G225, D226 and P227 (Figure 2). A similar observation, that the specificity of a homing endonuclease can be correlated to the reading frame and coding degeneracy within its host gene target site, has been reported for several homing endonucleases, including the I-AniI protein in Aspergillus nidulans (28).
The amino acid sequence encoded by the central region of the endonuclease target site spans the functionally critical ‘L2’ region of the RecA protein (Figure 8c). RecA forms helical filaments composed of multiple RecA monomers bound to single-stranded DNA (PDB ID 3cmt). When examining the L2 region in the context of these filaments, its residues are observed to form a β-hairpin structure that is involved in contacting the DNA backbone and at least 1nt base (33) (Figure 8d). Regions corresponding to L2 are also found in eukaryotic and archeal RadA/Dmc1 proteins and bacterial DnaA proteins, both of which have similar DNA-binding activities (22). The L2 loop has previously been shown to be an insertion site for invasive inteins in several bacteria, including the recA gene of Mycobacterium leprae (34).
Homing endonucleases share common evolutionary ancestors with a wide variety of host proteins that are responsible for an equally broad range of biological functions. For example, a large bacterial superfamily (the DUF199 proteins) that is thought to be involved in transcriptional activation of genes involved in sporulation or other differentiation and growth processes have been shown to contain LAGLIDADG domains (35). The HNH catalytic motif is found in non-specific bacterial and fungal nucleases (36,37), and is also found in a wide range of DNA-acting enzymes including transposases, restriction endonucleases, polymerase editing domains and DNA packaging factors (38,39). The GIY-YIG catalytic motif is found in several bacterial restriction enzymes (such as Eco29kI) (40) and enzymes involved in DNA repair and recombination (such as the UvrC base-excision repair endonucleases) (41). Finally, the bacterial homing endonuclease I-Ssp6803I is a PD…(D/E)xK endonuclease, which is the most common catalytic protein fold in type II restriction endonuclease systems (5).
The discovery of a new homing endonuclease lineage (11) as characterized in this study again illustrates an evolutionary relationship between modern-day homing endonucleases and distantly related bacterial proteins (in this case, between a bacteriophage-derived homing endonuclease and a DNA mismatch repair enzyme). The ‘PD…(D/E)xK’ motif observed in these proteins (SCOP family 3.72.1) has been greatly diversified during evolution, facilitating its use for many biological functions (42). It has been visualized many times in restriction endonucleases, as well as in a variety of other contexts, including tRNA-specific homing endonucleases and a variety of DNA repair enzymes. All known variants of this fold display at least two acidic residues, and usually at least one additional basic residue in the nuclease active site, forming the catalytic motif that catalyzes phosphoryl transfer reactions (43).
Vsr endonucleases (and presumably I-Bth0305I) display a type II restriction enzyme topology that has significantly diverged from the canonical ‘PD…(D/E)xK’ motif, including the use of an activated histidine as a general base (14). The I-Bth0305I homing endonuclease and its nearest cousins appear to have maintained most of the features of this unique active site arrangement, although at least one additional strongly conserved acidic residue in the active site region (a strongly conserved acidic residue at the position corresponding to Phe62 in Vsr) may indicate further subtle divergence in catalytic mechanism.
Finally, the predicted bipartite structure of the homing endonuclease described in this study leads us to the possibility that the nuclease domain, on its own, might offer a useful catalytic fold for use in artificial gene targeting nucleases. This technology involves the creation of artificial nucleases by appending a non-specific nuclease domain (almost always the catalytic domain of the FokI restriction endonuclease) to a DNA recognition and binding construct consisting of a tandem array of zinc fingers or TAL repeats (44,45). The isolation and characterization of an independently folded nuclease domain that (i) appears to display a moderate degree of sequence specificity directly at the site of cleavage and (ii) naturally dimerizes prior to DNA binding may allow the development of new types of gene targeting proteins with novel DNA cleavage properties that prove useful for certain biotechnology and genome engineering applications.
Structure factor amplitudes and refined coordinates for the catalytic domain of I-Bth0305I have been deposited at the RCSB protein database under accession code 3R3P and designated for immediate release upon publication.
Supplementary Data are available at NAR Online.
National Institutes of Health research (grant R01 GM49857 to B.L.S.); National Institutes of Health training grant appointment (T32 GM08268 to G.K.T.); Hermann and Lilly Schilling Foundation chair (to S.P.). Funding for open access charge: National Institutes of Health (grant R01 GM49857).
Conflict of interest statement. One of the authors (B.L.S.) is a founder of a startup company that conducts research on homing endonuclease and gene targeting proteins. The protein described in this study is the subject of a recent patent application for construction of novel gene targeting protein scaffolds.
The authors thank members of the Stoddard lab (particularly Ryo Takeuchi and Brett Kaiser) and Geoff Wilson at New England Biolabs for invaluable advice and assistance on this project.