TAL (transcription activator–like) effectors (TALEs) are major virulence factors secreted by bacteria of the genus Xanthomonas
that cause diseases in plants such as rice and cotton (1
). TALEs, also known as AvrBs3/PthA family effectors (5
), are injected into plant cells through a type III secretion system and interfere with cellular activities through transcriptional activation of specific target genes (1
). TALEs share a common domain organization that enables them to be imported into nuclei and act as transcriptional activators (10
The central DNA binding domain of TALEs consists of 1.5 to 33.5 tandem repeats (TAL repeats), with each repeat recognizing one specific DNA base pair (14
). Each TAL repeat contains 33 to 35, mostly 34, highly conserved amino acids (16
). Within each repeat, two hypervariable residues at positions 12 and 13 (also known as RVDs for repeat variable diresidues) confer DNA specificity (14
). The code of DNA recognition by RVDs has been deciphered by both experimental (14
) and computational (15
) approaches. The frequently occurring RVDs His/Asp (HD), Asn/Gly (NG), and Asn/Ile (NI) recognize cytosine (C), thymine (T), and adenine (A), respectively (1
). DNA binding by TAL repeats is modular, allowing engineering of DNA-binding proteins by assembly of TAL repeats with designed RVDs, for example, for use in targeted gene activation (14
). Despite these advances how TAL repeats specifically recognize DNA remains unknown.
We investigated an artificially engineered TAL effector, dHax3 (20
) (fig. S1
). The central domain of dHax3 (residues 270 to 703), containing 11.5 TAL repeats, was crystallized in the space group C2221
). The structure was determined by Ta6
-based multiwavelength anomalous diffraction and refined to 2.4 Å resolution (tables S1 and S2 and fig. S2A
). There is one molecule in each asymmetric unit. In the crystals, crystallographically independent molecules are arranged to form a continuous right-handed, superhelical assembly (fig. S2B
). The structurally well-defined region of DNA-free dHax3 (residues 303 to 675) forms exactly 11 repeats, starting from the second half of repeat 1 and ending at repeat 11.5 (). The superhelical assembly has an external diameter of about 60 Å.
Fig. 1 Structure of the TAL repeats in DNA-free dHax3. (A) The 11 TAL repeats of dHax3 form a right-handed superhelical assembly. Two perpendicular views are presented with the RVDs highlighted in red in the right image. (B) All TAL repeats exhibit a nearly (more ...)
Each TAL repeat in dHax3 contains 34 amino acids, with residues 3 to 11 forming a short α helix (designated as “a”) and residues 15 to 33 constituting an extended, bent α helix (designated as “b”). The two helices are connected by a short loop consisting of RVD and an invariant amino acid Gly at position 14 ( and fig. S1
). This loop is hereafter referred to as the RVD loop. Reflecting the high degree of sequence conservation (fig. S1
), all 11 repeats exhibit a nearly identical conformation ( and fig. S2C
). Helices a and b within each repeat closely stack against each other through extensive van der Waals contacts (). The angle between the helices distinguishes the TAL repeat from other known α-helical repeat modules such as HEAT (22
) and TPR (23
), in which the two helices are nearly parallel to each other. A nuclear magnetic resonance (NMR) structure of 1.5 TAL repeats in the protein PthA was previously reported (24
); however, our TAL repeat structure exhibits major differences from that in PthA (fig. S2, D and E
The 11 TAL repeats of dHax3 complete a full helical turn; the RVD loops form the innermost spiral with a pitch of 60 Å per turn (). The 11 a helices form an internal layer along the superhelical axis, whereas the 11 b helices constitute an external layer (). These structural features suggest a DNA-binding model in which the DNA molecule is placed within the TAL superhelical assembly along the axis.
We crystallized a binary complex between dHax3 (residues 231 to 720), which encompasses the entire 11.5 TAL repeats, and a 17–base pair (bp) DNA binding element (20
), with 5′-TGTCCCTTTATCTCTCT-3′ as the sense strand. The structure was determined by molecular replacement at 1.85 Å resolution (table S2 and fig. S3A
). There are two complexes in each asymmetric unit (fig. S3B
). The two protein molecules (designated A and B) can be superimposed with a root mean square deviation (RMSD) of 1.2 Å over 447 Cα atoms (fig. S3C
). Because these two complexes exhibit identical features for most repeats, we mainly describe structural analysis on molecule A.
In the complex structure, dHax3 comprises 12 repeats (residues 289 to 691), with the C-terminal 0.5 repeat contributed by nonconserved amino acids (). These repeats are capped by three and two short α helices at the N and C termini, respectively (). Similar to DNA-free dHax3, all repeats exhibit a nearly identical conformation except RVD loops in repeat 6 of molecule A and repeat 5 of molecule B (figs. S3D
). The superhelical dHax3 structure tracks along the major groove of the DNA duplex. The conformation of the 17-bp DNA is largely B-form (table S3
), with 11 base pairs per turn and a pitch of about 35 Å.
Fig. 2 Overall structure of dHax3 bound to DNA. The superhelical structure of dHax3 (residues 231 to 720) binds to the major groove of DNA. Shown on the right are the DNA sequence of the sense strand and the corresponding RVDs in TAL repeats of dHax3. dHax3 (more ...)
In the structures of both DNA-free and DNA-bound dHax3, there are 11 TAL repeats per helical turn (). Comparison of any corresponding repeat between these two structures reveals little difference, with an RMSD of 0.25 to 0.34 Å over about 30 Cα atoms (). However, the superhelical pitch is reduced from 60 Å in DNA-free form to about 35 Å in the DNA-bound form ().Whereas the main chains of the first 22 amino acids are precisely superimposed, subtle conformational variations accumulate for residues 23 to 34, resulting in notable differences between the positions of the Cα atoms in Gly34
(). Such differences are gradually amplified over an increasing number of repeats (fig. S5
), ultimately resulting in the compression of the superhelical assembly in the DNA-bound form. Such conformational plasticity is consistent with the predominantly van der Waals interactions between adjacent TAL repeats, which can tolerate minor distance shifts (, and fig. S6
). The ability to undergo substantial conformational changes appears to be a conserved feature for superhelical assemblies exemplified by Armadillo repeats in β-catenin (25
) and HEAT repeats in keryopherin α (26
) and the scaffold subunit of protein phosphatase 2A (PP2A) (27
). The conformational plasticity of the TAL repeats, which was previously noted (24
), is likely essential for the function of TALEs.
Fig. 3 Structural comparison of DNA-free and DNA-bound TAL repeats in dHax3. (A) DNA-free and DNA-bound dHax3 are shown for residues 323 to 675, which comprise TAL repeats 2 to 11. The two structures are superimposed by using the N-terminal 23 amino acids, which (more ...)
Analysis of the electrostatic surface potential reveals a stripe of positively charged amino acids along the inner ridge of the dHax3 superhelical assembly ( and fig. S7A
). Each phosphate group in the sense strand of the DNA duplex is accommodated in a shallow surface pocket along the basic stripe (, left). Lys16
, which are located at the beginning of helix b in each repeat, contribute to the positive electrostatic potential for interaction with the negatively charged phosphate (, right). Interaction with the phosphate group of DNA duplex, invariant among repeats 1 through 11, is mediated by hydrogen bonds (fig. S7, B and C
Fig. 4 DNA recognition by TAL repeats. (A) The phosphate groups of the DNA sense strand is embraced by the positively charged ridge of the dHax3 TAL repeats. The surface electrostatic potential was calculated with PyMOL (30) (left). The invariant residues Lys (more ...)
The two hypervariable residues in the RVD loops, positioned in close proximity to the sense strand in the DNA major groove (), play different biochemical roles. Residue 12, either His or Asn in the 11.5 TAL repeats of dHax3 (fig. S1
), does not directly contact DNA. Instead, the side chains of His12
point away from DNA bases, each making a direct H bond to the carbonyl oxygen atom of Ala8
, which is invariant and located at the C-terminal end of helix a in each TAL repeat (). Thus, the primary role of residue 12 in TAL repeats is not to directly recognize DNA but to stabilize the local conformation of the RVD loops. Supporting this analysis, there is a water-mediated H bond between the imidazole group of His12
in TAL repeat 1 and the carboxylate oxygen atom of Asp13
in repeat 2 (). Identical interaction is observed between His12
of repeat 2 and Asp13
of repeat 3. These structural findings demonstrate that His12
contributes indirectly to DNA binding by stabilizing the proper conformation of the RVD loops, which allows residue 13 to specifically recognize DNA bases.
Among the more than 20 codes identified for DNA recognition by TALE RVDs, some are more frequently observed than others (1
). The TAL repeats in dHax3 use three codes, in which the two hypervariable residues HD, NG, and NS specifically recognize the DNA bases C, T, and A, respectively (20
). These three codes account for about half of all cases reported (1
). The structure of DNA-bound dHax3 provides a satisfying explanation to these codes. In the case of HD→C, the carboxylate oxygen atom of Asp13
accepts a H bond from the amine group of cytosine in TAL repeats 1 to 3, 9, and 11 (). In the case of NS→A, the hydroxyl group of Ser13
in TAL repeat 7 donates a H bond to the N7 atom of adenine (). Compared with HD, NS is nonselective in that it can recognize all four bases (14
). Similar to adenine, guanine also contains a N7 atom, which is likely recognized by Ser13
in the same manner. Recognition of cytosine or thymine may require a slightly different conformation of the RVD loop, a scenario awaiting further structural evidence.
The correlation between NG and the base T is intriguing. Instead of providing any specific interaction, the placement of Gly at position 13 allows sufficient space to accommodate the 5-methyl group of thymine (). In TAL repeats 4, 8, 10, and 12, the distance between the Cα of Gly13
and the 5-methyl group of thymine is between 3.4 and 3.7 Å, allowing van der Waals interaction. Substitution of Gly with any other residue would likely introduce steric clash with the 5-methyl group of thymine, providing a structural explanation for the observation that recognition of the base T usually requires Gly at position 13 (1
). However, in repeats 5 of molecule A and 6 of molecule B, the distance between Gly-Cα and the 5-methyl group of thymine is more than 5 Å. We speculate that mutation of Gly13
to an amino acid with a short side chain may be tolerated here.
Both the structure and the mode of DNA binding by the TAL repeats differ from those of other known DNA-binding domains such as zinc-finger domain, basic leucine zipper motif, and helix-turn-helix motif (fig. S8
). The modular nature of the DNA-TAL repeats is also different from that of known RNA-binding proteins such as trp RNA-binding attenuation protein (TRAP) (28
). The closest entry from an exhaustive search of the Protein Data Bank (PDB) using DALI (29
) is the structure of DNA-bound MTERF1 (mitochondria transcription terminator 1) (fig. S8
), which also exhibits a superhelical conformation and has a Z score of 7.0 and RMSD of 3.2 Å over 184 aligned Cα atoms with dHax3. However, the MTERF motif comprises two α helices and one 310
-helix, with considerable conformational variation among repeats. In addition, MTERF1 binding results in substantial unwinding of DNA duplex (fig. S8
Our structural investigation provides explanation for about half of the frequently used codes for DNA recognition by TAL repeats. Among the remaining codes, how NI and NN recognize the bases A and G/A, respectively, remains to be elucidated. We suspect that the second Asn residue of NN may favor G/A through a specific H bond. Some of the less frequently used codes can also be explained by our available structural information. For example, explanation for the code ND→C should be similar to that for HD→C, which was observed here (). On the other hand, rationalization for the code XG→T is likely the same as that for NG→T (). Because 5′-methylcytosine is similar to T, we suspect that XG might also be able to recognize 5′-methylcytosine.
Our study represents a step toward comprehensive rationalization of sequence-specific DNA recognition by TAL repeats. Many questions remain. It is yet to be seen whether the arrangement of 11 repeats per turn is unique to dHax3 or a common feature of all TAL repeats. Although the base T is required for repeat “0” (14
), our structure of DNA-bound dHax3 does not provide an intuitive clue, because T at position zero is not particularly coordinated by either the N-terminal domain or the adjacent repeats (fig. S9
). Nonetheless, visualization of the modular, base-specific recognition by the TAL repeats may greatly facilitate rational design of novel DNA-binding proteins with a range of pragmatic applications.