Homing endonuclease are proteins that drive the dominant, non-Mendelian inheritance of their own reading frames by catalyzing a double-strand break (DSB) at specific DNA target sites in a recipient genome (1
). The DSB is repaired via homologous recombination, using an allele of the target gene that contains the homing endonuclease gene (HEG) as a repair template; this copies the HEG into the site of DNA cleavage. HEGs are often embedded within self-splicing introns or inteins. The inclusion of a self-splicing genetic element as part of the mobile DNA allows invasion of highly conserved regions in crucial host genes without disrupting their essential functions. The coevolution of a homing endonuclease, its surrounding intron or intein, and the host gene results in an intricate network of genetic and physical interactions that affect the expression, specificity and invasiveness of the mobile element (2
To succeed as mobile genetic elements, homing endonucleases must balance competing requirements for high DNA cleavage specificity (to avoid host toxicity) versus the need for reduced fidelity at various base pairs in their target site (to facilitate genetic mobility in the face of sequence drift within potential DNA target sites). Homing endonucleases and associated mobile introns and inteins that have successfully achieved this balance are encoded in genomes of bacteria, organelles of fungi and algae, single cell protists and in the bacteriophage and viruses that accompany and infect those organisms.
There are five well-characterized families of homing endonucleases, which are each classified according to their unique protein folds and distinct catalytic active sites and DNA cleavage mechanisms (1
). Members of the ‘LADLIDADG’ family, so named on the basis of their most conserved protein motif, are found in eukaryotic organellar and archaeal genomes, and are the most specific of the known homing endonucleases (3
). They exist both as homodimers that are limited to recognition of palindromic and near-palindromic target sites, and as pseudosymmetric monomers (where two structurally similar domains are tethered together on a single protein chain) that can target completely asymmetric targets. Members of the ‘His-Cys box’ and the ‘PD…(D/E)-xK’ families (found in protists and in cyanobacteria, respectively) also form multimeric protein complexes that recognize symmetric target sequences (4
). In contrast, members of the HNH and GIY-YIG families (usually found in bacteriophage) display multidomain structures (corresponding to separate DNA binding and catalytic regions) and adopt highly elongated conformations when bound to DNA (6–8
). As a result, those proteins usually recognize long non-palindromic sequences with significantly reduced fidelity (9
Recently, a novel type of fractured gene structure, containing separately encoded halves of self-splicing inteins that interrupt individual host genes in the same locus, was discovered during an analysis of environmental metagenomic sequence data collected by the Global Ocean Sampling (GOS) project (11
). These split intein sequences are found in a diverse set of host genes that are primarily involved in DNA synthesis and repair. The inteins are themselves often interrupted either by open reading frames (ORFs) that encode members of the GIY-YIG homing endonuclease family, or by novel ORFs that do not exhibit significant sequence similarity to previously characterized homing endonuclease families. Homologs of those uncharacterized ORFs were also found associated with introns or as free-standing genes. In total, 15 members of the newly discovered gene family were described, including two within previously annotated recA genes in the NCBI sequence database.
The C-terminal region of this newly identified protein family displays limited sequence homology [typically corresponding to e-values from a BLASTP (12
] to the catalytic domain of the very short patch repair (‘Vsr’) endonucleases (enzymes that generate a 5′ nick at T:G mismatches in newly replicated DNA and thus stimulate DNA nucleotide excision repair) (13
). Several catalytic residues from Vsr endonucleases are conserved across all members of the new gene family, and form the composite sequence motif EDxHD. These residues include an essential aspartate that coordinates a catalytic magnesium ion, a histidine believed to act as a general base and a neighboring aspartate residue. Based on the presence of a recognizable endonuclease catalytic domain within these intron- and intein-associated microbial ORFs and the conservation of catalytic residues within that domain, this gene family was therefore hypothesized to encode a novel lineage of homing endonucleases.
These ORFs also display sequence signatures in their N-terminal regions that are similar to those found in several nuclease associated modular DNA-binding motifs (‘NUMODs’) (15
). NUMODs are frequently found in other homing endonucleases from bacteriophage, such as the GIY-YIG endonuclease I-TevI (8
) and the HNH endonuclease I-HmuI (6
). In those cases, the NUMODs are found at the C-terminal end of those proteins (a reversed domain organization compared to the metagenomic ORFs described above). The extended conformation that NUMOD regions adopt upon DNA binding dictates that they make relatively sparse contacts across their long target sites.
A representative member of this novel homing endonuclease family, which we have named I-Bth0305I, was identified in the NCBI sequence database during the same genomic analysis (11
). This ORF is located within a group I intron that interrupts the RecA gene of Bacillus thuringiensis
0305ϕ8–36 bacteriophage. Experiments described in this manuscript describe the binding site, cleavage pattern and specificity of I-Bth0305I, and the crystal structure of its catalytic domain. These experiments demonstrate that I-Bth0305I is a site-specific endonuclease that forms a homodimer and contacts a region of DNA up to 60
bp in length. Unlike many bacteriophage homing endonucleases (which tether relatively nonspecific catalytic nuclease domains to sequence-specific DNA-binding domains, and therefore display significant specificity for DNA base pairs that are located some distance from the site of cleavage), I-Bth0305I displays its greatest specificity across the central residues of its recognition site (spanning the positions of DNA cleavage and intron insertion), and little additional sequence specificity at positions more distant from the cleavage site. The crystal structure of the I-Bth0305I catalytic domain confirms that members of this putative homing endonuclease family share a common ancestor with the Vsr mismatch repair endonuclease, and supports a similar mechanism for DNA strand cleavage.