|Home | About | Journals | Submit | Contact Us | Français|
Integrase (IN) is one of only three enzymes encoded in the genomes of all retroviruses, and the one least characterized in structural terms. IN catalyzes processing of the ends of a DNA copy of the retroviral genome and its concerted insertion into the chromosome of the host cell. The protein consists of three domains, the central catalytic core domain flanked by the N-terminal and C-terminal domains, the latter involved in DNA binding. Whereas the Protein Data Bank contains a number of NMR structures of the N- and C-terminal domains of human immunodeficiency viruses (HIV-1 and HIV-2), simian immunodeficiency virus, and avian sarcoma virus (ASV) IN, as well as X-ray structures of the core domain of HIV-1, ASV, and foamy virus IN, plus several models of two-domain constructs, no structure of the complete molecule of retroviral IN has been solved to date. Although no experimental structures of IN complexed with the DNA substrates are at hand, the catalytic mechanism of IN is well understood by analogy with other nucleotidyl transferases, and a variety of models of the oligomeric integration complexes have been proposed. In this review we present the current state of knowledge resulting from structural studies of IN from several retroviruses. We also attempt to reconcile the differences between the reported structures, and discuss the relationship between the structure and function of this enzyme, which is an important, although so far rather poorly exploited target for designing drugs against HIV-1 infection.
Although the existence of retroviruses and their ability to cause diseases have been known for almost a century , it was the emergence of the acquired immunodeficiency syndrome (AIDS) in the early 1980s that has provided huge impetus to the structural studies of their protein and nucleic acid components. Retroviruses, most notably human immunodeficiency virus type 1 (HIV-1), are enveloped in a glycoprotein coat and lack the high degree of internal and external symmetry that makes it possible to crystallize many relatively simple viruses, such as picornaviruses, exemplified by the viruses that cause common cold and polio. It is thus unlikely that high resolution information about the structural organization of intact retroviruses could be obtained with the currently available methods such as crystallography, although significant progress in lower-resolution studies by electron microscopy has given us excellent ideas about global aspects of their structure .
A typical retrovirus such as HIV-1 has been described as “Fifteen proteins and an RNA” . Three of these proteins are enzymes that are retrovirus-specific and are encoded by all retroviral genomes , although additional enzymes are found in some retroviruses. The structures of two of these enzymes, protease (PR)  and reverse transcriptase (RT) [6,7], have been investigated in extensive detail during the last 20 years, using crystallography and NMR spectroscopy. A very large number of such structures, solved for both full-length apoenzymes and for complexes with substrates, products, effectors, and inhibitors, have been published [8-13]. The detailed structural knowledge, based on low-to-medium resolution structures of RT and medium-to-atomic resolution structures of PR, has been of considerable use in the design of clinically-relevant inhibitors of these enzymes [13,14]. At this time, 18 nucleoside and non-nucleoside inhibitors of RT, as well as 10 inhibitors of PR have been approved by the US Food and Drug Administration (FDA) for treatment of AIDS. By contrast, far less is known structurally about the third retroviral enzyme, integrase (IN), and fewer inhibitors of IN have been discovered so far. Only one of them, raltegravir, has recently gained FDA approval as an AIDS drug .
Although many anti-HIV drugs are already available, serious side effects and the emergence of drug-resistant mutations necessitate development of novel compounds. The current drugs targeting RT and PR are not without side effects. Significant side effects include myopathy, hepatic steatitis, and lipodystrophy, caused by RT drugs alone, or a combination of RT and PR drugs. RT drugs block several mitochondrial proteins (DNA polymerase γ, uncoupling proteins), while PR drugs such as amprenavir or indinavir block the mechanistically unrelated enzyme, mitochondrial processing protease . Inhibitors of IN appear to be particularly promising [17-19] since, unlike PR and RT, this enzyme does not have direct human homologs. Although such inhibitors might still affect the function of other enzymes, such as RAG1/2 recombinase , they have not as yet been shown to cause pathological effects. Drugs against IN might be given in higher, more effective doses with better tolerated side effects. The inhibitors/drugs currently in animal experimental or human clinical trials seem to be keeping this promise, having in the short term fewer side effects as compared to FDA-approved anti-PR or anti-RT drugs. In consequence, drugs targeting IN may be given in sufficiently high doses to fully block the enzyme from integrating viral DNA into the cell genome, thus allowing the host immune system to fight off the infection completely.
Whereas HIV-1 IN is clearly the most medically relevant IN, extensively investigated for over two decades, the enzyme encoded by avian sarcoma virus (ASV) was studied much earlier . In addition, enzymes from other retroviruses, including HIV-2, simian immunodeficiency virus (SIV), human foamy virus (PFV), Mason-Pfizer monkey virus (M-PMV), and feline immunodeficiency virus (FIV) have been investigated as well. Although significant amount of work was done with the latter enzyme , it will not be further discussed here since no crystals have been obtained. Similarly, we will not discuss any further M-PMV IN , since we are not aware of any advanced structural studies involving this protein.
As will be discussed later, no crystal structure of full-length IN is available at this time. However, many structures of fragments of this enzyme from several different viral sources have been solved by crystallography and NMR in the last 15 years (Table S1 in the Supplementary Material), including several important structures that have appeared since the last comprehensive review of this subject was published . These data will be discussed below.
In the present review, we focus predominantly on the structural aspects of retroviral integrases and not on the enzymatic mechanism and other functional features of these enzymes, which have been extensively reviewed earlier [24-27]. However, a short introduction about the basics of IN function is necessary to properly interpret the importance of various structural features.
The retroviral genomic RNA is reverse transcribed into a DNA copy by the previously mentioned retroviral enzyme, RT. The function of IN is to insert the resulting viral DNA into the host genome, with the reaction accomplished in two distinct steps (Fig. 1), both catalyzed by a triad of acidic residues in a characteristic D,D(35)E motif (two aspartates and a glutamate, the latter separated from the second aspartate by 35 residues) found in all retroviral INs. In the first, processing step, IN removes the two terminal nucleotides (GT in HIV-1, TT in ASV) from each 3′ end of the double-stranded viral DNA. The second step, called “joining” or “strand transfer”, involves a nucleophilic attack by the free 3′-hydroxyl of the viral DNA on the target chromosomal DNA, resulting in covalent joining of the two molecules. If the reaction is performed in a concerted manner, the second, coordinated insertion is made into the complementary strand of the target DNA, in a position 5 nucleotides away from the site of the first insertion (in HIV and SIV; 6 nucleotides in ASV). The subsequent removal of the two unpaired nucleotides at each 5′ overhanging end of the viral DNA and filling of the gaps are most likely performed by host enzymes.
Although the reactions described above require only the viral and host DNA substrates and divalent metal cofactors used by the integrase during the catalytic mechanism (physiologically Mg2+, but in vitro could also be Mn2+), more components are included in the preintegration complex (PIC) that is necessary for the integration to take place in the nucleus [28,29]. Preintegration complexes of HIV-1 have been shown to also contain viral RT and matrix proteins, as well as a number of host proteins. One of the latter proteins, called barrier-to-autointegration factor (BAF), appears to be crucial in preventing autointegration (integration of viral DNA into viral DNA) [30,31]. Whereas the structure of BAF complexed to DNA is known , its mode of binding to IN (if any) is not. The only cellular factor that has been shown experimentally to bind directly to IN is lens epithelium-derived growth factor (LEDGF), also known as PC4 and SFRS1 interacting protein 1 (PSIP1) or transcriptional coactivator p75 [33-36]. Structural aspects of its interactions will be discussed below. However, identification of all proteins that participate in creating PICs and assignment of their role is still not complete.
A single polypeptide chain of most retroviral INs comprises ~290 residues and consists of three clearly identifiable domains , as well as inter-domain linkers. Some important variations are, however, present. For example, PFV IN is significantly longer, comprising 392 residues, and ASV IN is encoded as a 323-amino-acid-long protein that is post-translationally processed to the final polypeptide consisting of 286 residues, which is fully enzymatically active . It must be stressed, however, that definition of the domain boundaries is to a certain extent arbitrary, due to the differences in the lengths of the linking sequences, as well as to difficulties in assignment of the residues at the borders between the domains and the linkers. As shown in Fig. 2, the N-terminal domain (NTD) of HIV-1 IN contains residues 1-46, followed by a linker consisting of residues 47-55. The catalytic core domain (CCD) contains residues 56-202, and is followed by a linking sequence 203-219. Finally, the C-terminal domain (CTD) contains residues 220-288. The residue numbers at domain boundaries for enzymes from HIV-2 and SIV are approximately the same, whereas they differ for ASV IN (Fig. 2). For PFV IN, a possibility exists that an additional domain consisting of approximately 50 residues might be present at the N terminus, preceding the NTD domain. For practical reasons, slightly different starting and ending points have been utilized for cloning of individual domains and/or two-domain constructs that were used in structural studies. The structures of representative isolated domains of IN are shown in Fig. 3.
The sequence identity/similarity percentages for full-length HIV-1 IN are 58/74% in comparison with SIV IN, and 23/37% with ASV IN, respectively (Fig. 2). These numbers are not fully accurate, since they depend on the correctness of the structure-based alignment of IN from different viral sources. For individual domains, the identity/similarity percentages are for the NTD 55/76% comparing HIV-1 to SIV IN and 26/46% comparing it to ASV IN; for the CCD they are 61/77% and 27/46%, and for the CTD they are 53/68% and 14/25%, respectively. Clearly, sequence conservation is the lowest for the C-terminal domain. It should be stressed that the sequences included in Fig. 2 are shown for enzymes encoded by specific retroviral strains and that quite significant variations between different strains have been observed . In addition, crystallographic studies of some CCDs of IN or of two-domain constructs were only possible after the introduction of mutations (see below).
Until now, no reports of crystallization of isolated NTD or CTD domains have appeared. First crystals of the HIV-1 IN CCD  were only obtained after an extensive mutagenesis study, which identified a mutant, F185K, with enhanced solubility . A protein with a substitution F185H, corresponding to the structurally equivalent residue present in ASV IN, was also crystallized . A further mutation W131E was introduced to the HIV-1 IN CCD to enhance solubility even more . The CCD of ASV IN could be crystallized without mutations, although special precautions in protein handling were necessary.
The NTD-CCD construct of HIV-1 IN was crystallized using a soluble variant of the protein with the above-mentioned mutation F185K, as well as with two additional ones, W131D and F139D . The combination of these mutations and use of a specific buffer allowed to increase the protein concentration up to 10 mg/ml and resulted in the growth of diffraction-quality crystals. The same three mutations were also used in crystallization of the CCD-CTD construct of HIV-1 IN, where they were also introduced with the aim of increasing the solubility . Two additional mutations, C56S and C286S, were introduced to prevent non-specific aggregation. However, the structure of the analogous two-domain construct of SIV IN included only a single mutation, F185H, implemented to improve protein solubility .
The central domain of IN (CCD) contains the complete catalytic apparatus and exhibits limited activity even in the absence of the other domains. Although CCD by itself does not carry out the joining reaction, it does support processing, albeit with decreased specificity . CCD also supports a reaction called “disintegration”, in which donor and acceptor DNA molecules are regenerated from a substrate with a Y-letter topology . Due to its importance as the core of the enzyme and because of the failure to crystallize intact INs, CCD was the first target for structural investigation of these proteins.
The structures of the isolated CCDs (Fig. 3B) have been determined in about three dozen crystallographic studies of HIV-1 IN [40,42,43,45,48-51], ASV IN [52-57], and PFV IN . In addition, seven medium-to-low resolution structures of fusion constructs with one of the terminal domains also included CCDs of HIV-2  and SIV . Since crystals of the ASV IN CCD were easier to grow, they were studied more extensively, yielding excellent structural data, such as the atomic resolution structure with the PDB code 1CXQ . The CCD has been studied in its apo form and in various metal-complexed forms, including the catalytically competent divalent cations Mg2+ and Mn2+. Again, ASV IN has provided a more exhaustive picture of metal coordination by the catalytic core domain, including occupation of multiple metal sites, or the presence of cations such as Zn2+ that can also act as inhibitors of IN activity. Whereas six structures of small-molecule inhibitor complexes of the HIV-1 and ASV CCD have been published [43,51,56], it has not been possible to elucidate any structure of a DNA complex, although some promising crystallization results were achieved. In variance with the situation concerning the structure of the peripheral IN domains, no solution structure of the CCD is available.
The CCD is built around a five-stranded mixed β-sheet flanked by α-helices (Fig. 3B). The antiparallel β1-β2-β3 hairpin-type arrangement is extended by two parallel strands β4, β5, which are part of two β-α-β crossovers, with the intervening helices α1, α3 plus a helical turn α2, all located on one side of the β-sheet. The other side of the β-sheet is covered by a long helix α4, which runs across its face. A helix-turn-helix (HTH) motif leads to a long stretch of nearly 40 residues that has a helical conformation (α5 and α6) except for a finger-like extrusion that is formed by about 12 residues (Phe185-Ala196 in HIV-1 sequence) in the middle. The finger has a peculiar conformation, extending away from the body of the enzyme (Fig. 3B). Its general conformation is similar in CCDs from different viruses, although it pivots on its points of attachment as a semi-rigid body. Despite its glycine-rich sequence, the finger is stabilized by conserved interactions, for example by a salt bridge (Arg187…Glu198 in HIV-1) anchored at the beginning of helix α6. The finger sequence of ASV CCD is the least conserved and, for example, the above salt bridge is not preserved. The amino-acid residues of the finger are hydrophilic, in accord with its solvent exposure in the isolated CCD domain, except for the very tip, which is occupied by a conserved Ile residue. (The presence of Glu203 in an equivalent location in the ASV IN sequence provides again an exception in this regard.) This unusual chemical character of the exposed tip together with lattice contacts formed by the finger loop are most likely responsible for its variations observed in different crystal structures. The C-terminal helix α6 of the CCD is truncated in PFV IN CCD and is completely absent in the construct of an isolated ASV IN CCD used for crystallographic studies [52,57]. However, the finger structure is clearly seen in the two-domain construct of ASV IN , where residues Lys199-Thr207 form an insert between helices α5 and α6. These observations may indicate that selection of Thr207 as the C-terminal boundary of the ASV IN CCD on the basis of extensive studies of many truncation constructs  might not correspond to a complete CCD.
The catalytic residues of the D,D(35)E sequence signature found in all INs are presented by the middle of chain β1 (D64), the loop connecting β4-α2 (the second aspartate), and by the N-terminal segment of α4 (the glutamate). They are juxtaposed in a row within a patch of negative charge on the surface of the rather flat, slab-like molecule. The active-site face of the slab is opposite to the CCD dimerization face and, therefore, the two active sites of the dimeric enzyme are far apart, nearly as far as the architecture of the dimer allows. Dimerization of the CCD involves a tandem of predominantly hydrophobic α1…α5′ interactions, plus hydrophobic contacts between helices α6 across the dimer two-fold axis, with additional hydrophilic contacts in the middle of the dimer. The latter interactions are interesting because they are connected with the formation of a hydrophilic cavity in the center of the dimer, filled by a few water molecules.
While the Cα traces of the ASV and HIV-1 CCD superpose quite well, the agreement of their dimers is less optimal and reflects a slight but evident difference in the dimer architecture. As a consequence of this difference, the two active sites of the HIV-1 IN CCD dimer are less distant (38.5 vs 42.5 Å, as measured by the separation of the catalytic Mg2+ ions). The distance between the two active sites is incommensurate with a 5-6 bp segment of double-helical B-DNA and suggests that the host DNA must be unwound for coordinated processing of the two strands, or, more likely, that two distinct IN dimers act each on only one insertion point. Until the structure of the complete IN enzyme is solved, it can only be assumed that dimerization of the core domains of the full-length proteins is not different from what has been observed for the isolated CCD domains. This assumption is supported by the consistent picture of CCD dimerization revealed by all structures of two-domain IN constructs and of complexes of IN with LEDGF [35,59].
The CCD of HIV-1 IN used in the first structure determination (1ITG; ) contained the F185K mutation introduced to enhance solubility. The cacodylate residue from the crystallization buffer was found attached to the cysteine side chains of the protein, including Cys65 located in the active-site area . The constellation of the catalytic acids (Asp64, Asp116, and Glu152) was found to be in an “inactive”, non-native configuration (Fig. 4A). The distortion of the catalytic apparatus became apparent only later, by comparison with other, unperturbed structures, notably the ASV IN CCD [52,53]. The non-native character of the active site is manifested by the altered conformations of the two aspartic acids, including a major re-orientation of the loop carrying the Asp116 residue, and in complete disorder of the helix fragment with the Glu152 residue and the entire flexible active-site loop in front of it (in total 13 residues, 141-153). It is unlikely that the distortion of the active site was effected by the presence of the unnatural arsenic substituent, as in a related structure of arsenic-free HIV-1 IN (2ITG; ), the catalytic aspartic acids are found in exactly the same inactive conformation. Although the structure 1ITG failed to map the functional state of the protein, it provided the first chain tracing and was important in revealing the plasticity of the IN active site and its ability to adopt different conformations.
Perhaps the most significant consequence of the inactive conformation of the catalytic residues is the inability of the two aspartate side chains to bind a catalytic divalent metal cation in a coordinated fashion. Such a cation, revealed by Mg- and Mn-complexes of ASV IN [53,54] and later by Mg complexes of HIV-1 IN [48,49] and PFV IN , has an octahedral coordination sphere completed by four water molecules (Fig. 4B). The triad of the catalytic acids can remain in the active conformation even in the absence of metal cations, but then the carboxylate groups are held in place by water-mediated hydrogen-bond bridges (Asp…Wat…Asp64…Wat…Glu). However, as revealed by the atomic-resolution structures of ASV IN, and in agreement with the requirement for basic conditions for IN activity (peak endonuclease activity at pH 8.5 ), conformational changes in the active site take place at pH below 6 and consist of protonation and a concomitant swing of the Asp64 carboxylate group out of its metal-coordinating position, and into a dual-hydrogen-bond lock with a neighboring asparagine residue. In addition, changes of pH influence the flexible active-site loop, which in HIV-1 IN comprises the residues 141-147, adjacent to the glutamate-bearing N terminus of helix α4, and which in all the crystal structures shows a variable degree of disorder. The flexible active-site loop contains highly conserved residues and appears to be involved directly in substrate contacts .
There is little doubt that the metal-coordination site formed between the two aspartate side chains (site I) corresponds to a cation essential for catalysis. The perfect octahedral geometry of this site explains why mutations of the catalytic aspartates cannot be tolerated. However, increasingly larger cations can still be accommodated, from Mg2+ (mean metal…O distance 2.11 Å), to Mn2+ (2.23 Å), and even Cd2+ (2.43 Å) and Ca2+ (2.46 Å for incomplete coordination sphere). Estimation of metal-binding geometry is more reliable from the ASV IN structures, which are in excellent agreement with expected coordination stereochemistry, for instance with valence parameters  of the central ion, which for the structures listed in Table S1 are calculated as 1.95 (1VSD), 1.92 (1A5V) or 1.79 (1VSJ), the ideal target being 2.00. The corresponding values for the HIV-1 IN data indicate a high level of error, e.g. 1.23/0.91 (1BL3) or even 1.08/0.80/0.79 (1QS4), presumably as a consequence of poor data quality or structure refinement protocols. There is an important difference between ASV and HIV-1 IN in coordinating high-electron metals in site I, connected with the presence of a cysteine residue at position 65 in the latter enzyme. The thiol group of this residue is found in the coordination sphere of the cadmium cations in 1EXQ . Since no such possibility exists in ASV IN, where a phenylalanine residue immediately follows the first catalytic aspartate, high-electron metals may have different impact on the catalytic properties of integrases from these two viruses. With light metals, such as Mg2+, the thiol group of Cys65 in HIV-1 IN assumes a totally different orientation and, consequently, there is no difference in the coordination chemistry between ASV and HIV-1 IN.
Structural data on inhibitor complexes of IN are limited to a few structures of the CCD (Table S1). The structure of an inhibitor 5-CITEP (1-(5-chloroindol-3-yl)-3-hydroxy-3-(2H-tetrazol-5-yl)-propenone) (Fig. 5A) in complex with Mg2+-containing HIV-1 IN CCD  is the only one that includes a compound capable of binding within the active site area of the enzyme. The IC50 value of 5-CITEP, measured in a reaction that monitors 3′ processing together with DNA strand transfer, was reported as 2.1 μM. This inhibitor was observed in only one of the three independent copies of the enzyme molecule present in the crystal. The molecule of 5-CITEP is located between the coordinated Mg2+ cation and the catalytic Glu152, with which it forms hydrogen bonds (Fig. 5B). The active site of the molecule to which the inhibitor is bound is located close to the crystallographic 2-fold axis, raising the possibility that the exact mode of binding might have been influenced by crystal contacts. The inhibitor makes no direct contacts with either Asp64 or Asp116, and has only an indirect, water-mediated contact with the bound Mg2+ ion. Two symmetry-related molecules of 5-CITEP interact directly with each other. In view of these facts, it is doubtful if this structure represents the true mode of binding that would be present in an IN-DNA complex.
Another IN inhibitor, 4-acetylamino-5-hydroxynaphthalene-2,7-disulfonic acid (Y-3, Fig. 5A), was cocrystallized with the ASV IN CCD in the absence and presence of Mn2+ . This aromatic molecule with several hydrophilic substituents does not bind in the active site of the enzyme but rather on its surface, where it participates in crystallographic contacts, although there is no interference with CCD dimerization. Its presence in the crystals is, however, not a crystallographic artifact since it is observed in the same context at different pH conditions and regardless of metal coordination. Although Y-3 forms no direct interactions with the catalytic residues, it does seem to influence the conformation of the flexible active-site loop by binding to Tyr143 and Lys159 (ASV numbering). Y-3 very likely directly interferes with DNA binding by hydrogen bonding to Lys119, a residue corresponding to His114 in HIV-1 IN, shown to be capable of crosslinking to DNA. It is quite possible that these interactions are the basis of its inhibitory capacity.
The inhibitors discussed above, as well as raltegravir (Fig. 5A), the only IN inhibitor approved for clinical use, are aryl diketo acid derivatives that inhibit strand transfer much more efficiently than 3′-processing . Such compounds are characterized by the presence of α and γ C=O groups in the vicinity of a carboxylic acid moiety, although the latter group can be replaced by a triazole or tetrazole ring . No structure of raltegravir complexed with IN has been published to date, but it is expected that its mode of binding might involve direct interactions with the divalent cation(s) present in the active site.
A different class of inhibitors for which structural data are available includes arsenic derivatives that were co-crystallized with HIV-1 IN . Crystal structures have been solved for tetraphenylarsonium chloride and 3,4-dihydroxyphenyl-triphenylarsonium bromide. Both compounds bind in a similar fashion at the interface of the CCD dimer and interact directly with Gln168 that belongs to one of the molecules. Surprisingly, the quality of the electron density maps is much better for the former compound than for the latter one, although only the latter one exhibits measureable inhibitory activity for the disintegration reaction (IC50 of 380 μM).
Since IN must form at least a dimer to be catalytically active, prevention of dimerization offers an interesting option for its inhibition . Several studies have reported inhibition of IN activity through the use of peptides derived from amino acid sequences responsible for the dimerization of the CCD domain [66,67], although no structural data are available. In some cases, it was possible to confirm that such peptides disrupted the association-dissociation equilibrium  or the crosslinking of the IN dimer . On the other hand, Hayouka et al.  have demonstrated that the opposite concept, namely forcing IN to form higher-order oligomers, may be a useful approach to rendering the IN inactive. Specifically, they used peptides (called “shiftides”) derived from the cellular IN-binding protein LEDGF, to inhibit the DNA-binding of IN by shifting the enzyme's oligomerization equilibrium from the active dimer toward the tetramer, which, according to their data, is incapable of catalyzing the first step of integration, i.e. the 3′-end processing.
Development of these and other classes of IN inhibitors is an ongoing process and some very potent inhibitors, with IC50 in the low nanomolar range, are now available . The process that led to the FDA approval of raltegravir, as well as clinical studies of other drug candidates, have been covered in a number of recent reviews [72-74]. In view of the paucity of available structural data on IN inhibitors, the wider subject of IN inhibitors in general cannot be adequately treated within the scope of the current review.
NMR structures of the isolated NTD domains were solved for INs from HIV-1  and HIV-2 . Multiple views of the N-terminal domain are also available in medium-resolution crystal structures of a two-domain construct of HIV-1 IN that contains the NTD and CCD domains (1K6Y; ) and of the HIV-2 NTD-CCD/LEDGF complex (3F9K; ). The solution structure of the HIV-1 IN NTD domain showed the existence of dimers consisting of two interconverting protein forms . The two forms, denoted D (1WJA) and E (1WJC), were observed together in the NMR experiment, with the D form seen mostly above ~300 K, and the E form below that temperature. A form intermediate between these two was reported for an H12C mutant of the N-terminal domain (1WJE; ).
The structure of a monomer of the NTD consists principally of four helices (Fig. 3A). Helix 1 comprises residues 2-14 in the E form and 2-8 in the D form, helix 2 residues 19-25, helix 3 residues 30-39 and helix 4 residues 41-45. The segment beyond residue 46 belongs to the interdomain linker and is disordered. A zinc cation is tetrahedrally coordinated by His12, His16, Cys40, and Cys43, although the details of the interactions with the histidine residues differ between forms D and E.
The E form of the NTD is very similar to its counterpart seen in the crystal structure of the two-domain construct (1K6Y; ), with an rms deviation of 1.05 Å between molecules A of the models. By comparison, the rms deviations between molecule A and the other three molecules seen in the crystal range from 0.28 to 0.63 Å. Form D of the NTD deviates by almost 2 Å from its crystallographic counterpart. As expected, the interactions of the Zn2+ cation with its ligands in the crystal structure correspond to the structurally closer E form.
The structure of the NTD of HIV-2 IN [78,79] is very similar to its HIV-1 counterpart. A comparison between molecule A of the first model in the assembly in 1E0E (no average structure available) and molecule A of 1K6Y shows an rms deviation of 0.86 Å, although sequence identity between the two proteins is only 55%. The details of the interactions with Zn2+ are also almost identical in the integrase NTD domains of HIV-1 (E form) and HIV-2. An rms deviation between NTDs belonging to molecules A and B in the structure of the HIV-2 IN NTD-CCD/LEDGF complex (3F9K; ) is 0.44 Å, whereas the deviation between NTD A of 3F9K and 1E0E is 1.17 Å.
The structure of the isolated CTD of HIV-1 IN (residues 220-270, the carboxyl terminus truncated) was solved independently by two groups using NMR (1IHV;  and 1QMC; [78,81]). In addition, the structures of the CCD-CTD constructs were determined by X-ray crystallography for ASV IN (1C0M, 1C1A; ), SIV IN (1C6V; ), and HIV-1 IN (1EX4; ). The structures of the CTD show the presence of dimeric molecules whose subunits were modeled as identical in 1IHV and as very similar in 1QMC (rms deviation 0.34 Å calculated for model 1, since no average structure is available). The rms deviation between these two structures is 1.2 Å. The deviations between the NMR structures of the isolated CTD and the crystallographic models of the two-domain constructs are larger, 1.65 Å between 1IHV and 1EX4 (both HIV-1 IN), 1.87 Å for 1C6V (SIV IN), and 2.05 Å for 1C0M (ASV IN). The four CTD domains present in the crystal structure of ASV IN consist of two very similar pairs (AB and CD, rmsd ~0.15 Å), whereas the rms deviation between molecules A and C is 0.77 Å.
A monomer of the CTD of HIV-1 IN consists of five β-strands (residues 222-229, 232-245, 248-253, 256-262, 266-270), arranged in an antiparallel manner in a β-barrel (Fig. 3C). Eighteen residues that were not included in the constructs used in the NMR experiments are also not seen in the X-ray structures of HIV-1 and SIV IN, and are presumed to be disordered. The topology of the CTD is reminiscent of SH3 domains, found in many proteins that interact with either other proteins or with nucleic acids, although no sequence similarity to SH3 proteins could be detected.
Two structures of the NTD-CCD constructs are available. A 2.4-Å resolution crystal structure of NTD-CCD of HIV-1 IN offers multiple views due to the presence of four molecules in the asymmetric unit (1K6Y; ), paired into AB and CD dimers, in which the twofold relationship between the catalytic domains resembles that of the isolated CCDs. Molecules A and D are very similar (rmsd 0.43 Å), whereas molecules B and C are more distant (rmsd 1.85 Å) mostly due to small changes in the interdomain angles. The interdomain linker region (residues 47-55) is disordered in all molecules, but the authors have postulated a pattern of domain connectivity taking into account the presence of NTD…CCD contacts (involving the tip of the finger loop of the CCD and one side of helix 20-24 in the NTD) and of NTD…NTD′ interactions in the dimer that would conserve the symmetry of the CCD…CCD′ dimer, and arguing that any other NTD-CCD connection would be incompatible with the length of the linker (Fig. 4A). In that interpretation, the distance between the end of the NTD and the beginning of CCD is about 9 Å. However, that view is contradicted by the 3.2-Å resolution crystal structure of the NTD-CCD construct of HIV-2 IN (3F9K), in which 24 IN molecules create 12 crystallographically independent dimers, each interacting with a single molecule of LEDGF . Whereas the connection between NTD and CCD is broken in the electron density map of one of the IN molecules in each assembly, it is unambiguous in the other one, forming an extended chain ~18 Å in length.
Surprisingly, careful analysis of the structure 1K6Y allows re-connection of the separated NTD and CCD domains in all four molecules in exactly the same manner as in the 3F9K structure (Fig. 6C), by the use of symmetry-related domains and of NTD-CCD linkers equivalent to the intact linker from the 3F9K structure. In this model, which differs significantly from the one originally proposed , the NTD forms a compact structure with the CCD, using the finger loop of the latter as a docking site with a number of hydrogen-bond and electrostatic points of attachment (Fig. 7). To reconcile the two models of NTD-CCD arrangement, Cherepanov and colleagues  have invoked the mechanism of 3D domain swapping. However, while this is certainly a possibility, it may be more prudent to conclude that the arrangement seen in the 3F9K structure is the only model that is currently supported by experiment. The relevance of the observed NTD-CCD interactions to the functional properties of IN is not yet clear.
The structures of two-domain constructs comprising the CCD and CTD were solved independently for HIV-1 IN at 2.8 Å resolution (1EX4; ), for SIV IN at 3.0 Å resolution (1C6V; ), and for two crystal forms of ASV IN at 2.5 Å (1C0M; ) and 3.1 Å (1C1A; ) resolution. The crystals of the HIV-1 IN contain two molecules forming a dimer, although the two-fold axis relating the CCD domains differs from the operation connecting the CTD domains. In each molecule, the two domains are connected by a long, well-defined helix comprising residues 195-222. The helix separates the CCD from CTD by as much as 30 Å (Fig. 6D).
The two crystal forms of the ASV IN contain a single dimer, or a pair of dimers. Similarly to what was observed in HIV-1 IN, the symmetry operations between the two domains of each dimer differ for the CCD and CTD. The linker between the CCD and CTD comprises residues 213-223 which assume a completely extended conformation, and not the helical form observed in HIV-1 IN. Thus the number of amino acid residues forming the linker in ASV IN is much smaller than in HIV-1 IN, although the distance between the starting and ending points of these linkers is not very different, at least for one of the two crystallographically independent molecules of ASV IN.
Whereas the crystals of SIV IN also contain two dimers in the asymmetric unit, only a single CTD (denoted X) could be traced unambiguously. The chain connecting it to the CCD domain could not be traced and the authors postulated a connection with chain A of the catalytic domain . If that were the case, the two domains would form a fairly compact molecule, with multiple interdomain contacts. However, an alternative assignment of the visible CTD domain to the D chain of CCD  would create an extended two-domain molecule not unlike that of the other two enzymes, although the inter-domain angles would differ in each of the structures. In any case, a comparison of the three structures makes it clear that the arrangement of the domains shows considerable variability and may be influenced by other parts of the molecular complex.
One of the measures of the extent of interactions between the domains of IN (dimerization of identical domains, and oligomerization of different domains) is the surface area buried in their interfaces (BSA). Calculations of BSAa have been performed for a representative set of IN structures (Table S2). The CCD:CCD interactions extend over a fairly uniform area of about 1000-1650 Å2. This area does not depend on the presence of the linkers, at least with regard to the NTD-CCD linker (as shown by assigning the linker to either domain, or removing it altogether for the structure 3F9K). The most extensive association (largest BSA) characterizes the CCD:CCD dimer of HIV-1 IN (about 1500 Å2), and decreases in the order HIV-1>HIV-2 (~1330 Å2)>SIV (~1250 Å2)>ASV (~1080 Å2)>PFV (~1000 Å2).
Homodimeric interactions between the CTDs range between none to negligible (BSA at most ~450 Å2). In most structures the CTDs in the dimers of CCD-CTD constructs are far away from each other, possibly due to the influence of crystals contacts. However, even in the solution dimer of isolated CTD, the area of interaction is very limited (~330 Å2).
The interaction of isolated NTDs in solution is slightly stronger (~510 Å2) but still rather insubstantial. In the dimer of the NTD-CCD construct with the conformation substantiated by the 3F9K structure, direct NTD:NTD interactions are, of course, none because the NTDs fold back on their respective CCDs, and thus are completely isolated from each other. However, even in the model proposed speculatively for the HIV-1 construct, the area of direct NTD:NTD interaction is so small (260 Å2) that it can be safely neglected.
Since the NTD in a multi-domain construct folds back on the catalytic core domain, the calculation of the NTD:CCD interaction area will strongly depend on the treatment of the linker peptide (residues 47-55 in the HIV-1 IN sequence). When this sequence, which is anyway disordered in most of the structures, is completely omitted, the value of BSA is ~530 Å2. Assigning the linker to the CCD yields a slightly higher apparent buried surface (~670 Å2), but the linker certainly should not be treated as part of the NTD domain, since in that case the buried surface would be unreasonably large (~1050 Å2).
The interaction between the CCD:CTD domains is very limited, with BSA values falling below 400 Å2. In the published 1C6V model (SIV IN), the BSA exceeds 600 Å2, but after a more plausible re-interpretation of the assignment of the visible CTD to a CCD, the interaction area drops to ~100 Å2, thus becoming insignificant.
As can be gleaned from these calculations, the solvent-excluded buried areas of the homo- and hetero-interactions between the domains of IN are, with the exception of the CCD:CCD contact, not very extensive and their actual values are strongly dependent on the details of the structures used for their calculations, emphasizing once more the flexible nature of this enzyme. It must be also noted that despite the variation of the BSA calculated for the homodimer of the CCDs for IN originating from different viruses, the nature of the interactions is preserved. However, no similar consistency is seen for the homodimers of the other two domains, and the picture of the inter-domain interactions is even less clear.
Although a number of proteins have been implicated as putative components of the preintegration complex with integrase , the only available structural information is for complexes of the integrase binding domain (IBD) of LEDGF with CCD of HIV-1 IN , and with NTD-CCD of HIV-2 IN . The IBD used in these experiments included residues 347-442 of LEDGF. The complex of LEDGF with HIV-1 IN CCD consists of two catalytic domains of IN bound to two IBDs in a fully symmetric fashion. Each IBD interacts with segments of the two CCDs, the latter forming a typical dimer, as observed in all other structures of IN CCDs. The most extensive interactions between IBD and IN are with a segment including residues 166-171 of molecule A (a connecting peptide between helices α4 and α5, described as an unusual helix-turn-helix motif ) and bury a surface area of 319 Å2. IBD also interacts with residues belonging to helix α3 (and, to a lesser extent, helix α1) of molecule B (buried surface area 379 Å2). Due to the symmetry of the complex, the second IBD molecule interacts with the corresponding areas of molecules B and A of the CCD. The interactions of IBD with CCD in the complex with HIV-2 NTD-CCD are virtually identical, with additional, mostly electrostatic interactions provided by the N-terminal helix of NTD that belongs to molecule A (buried area 153 Å2). It is intriguing, however, that the latter complex, which was prepared by simultaneous co-expression of the interacting proteins, lacks the second IBD, even though the superposition of the common IBD in the two structures is almost exact, and the second binding site is fully formed, including the same positioning of the NTD. The importance of the structurally-derived interactions between IN and IBD was verified by extensive mutational studies of the respective interfaces . It was also reported that areas of full-length LEDGF other than IBD may be involved in interactions with IN [83,84].
The oligomerization state of IN in vivo is still not known, but extensive in vitro work has shed light on this matter. The isolated NTD, CCD, and CTD domains all remain in solution as dimers, a conclusion uniformly supported by solution chemistry and structural biology studies . However, experiments that found IN…DNA interaction sites by photocrosslinking also suggested that IN acts as an octamer . Comparison of simulation analysis against time resolved fluorescence anisotropy measurements of rotation correlation times could distinguish monomers, dimers, and tetramers, while octamers could not be resolved from higher-order species . At micromolar concentration IN exists as tetramers, octamers, and higher-order aggregates, but such concentration is much higher than cellular. At catalytic (submicromolar) concentration, these experiments showed that IN could exist as a monomer, while addition of Zn2+ stimulated dimer formation. However, the authors noted that the standard buffer conditions include detergents, which dissociate IN oligomers . Solution small-angle X-ray scattering (SAXS) data for complexes of IN with oligonucleotides also indicated primarily monomeric species , with the same caveat regarding possible effects of detergents on the oligomerization state. If detergent is eliminated from the purification and assay experiments, the IN exhibits different assembly and catalytic properties.
It must be pointed out that all of the experiments mentioned above used indirect measurements of the size of IN oligomers. More direct observations involving atomic force microscopy of intact IN complexed to a DNA substrate have shown visually that the size of these complexes is consistent with a tetramer of IN molecules . Similar results were obtained by electron microscopy and single-particle image reconstruction that yielded coarse three-dimensional models at ~27-Å resolution . Finding of a tetramer as the predominant feature agrees with several IN/DNA models, with analysis of IN isolated from nuclear extracts and its complex with LEDGF , and with dynamic light scattering experiments.
The assembly of HIV-1 IN into oligomers is different when in complex with Mn2+ vs Mg2+ under various in vitro conditions . These experiments did not clarify which cation is preferred, but they did show that HIV-1 IN had no active-site cation preference when already in complex with a structural (non-catalytic) Zn2+ cation. The authors concluded that binding of the catalytic cation and DNA requires a pre-existing specific IN conformation.
A number of models of IN/DNA complexes were proposed, involving either just the CCD [91-95], or the full-length enzymes [96-98]. As no structure of intact IN molecule has been reported to date, the two-domain IN constructs, namely NTD-CCD and CCD-CTD, are being used as starting points for building models of the complete HIV-1 IN protein and IN/DNA complexes . These structures will be informative since they complement each other, and physically fit well together. However, it must be stressed that the IN domains are connected by flexible linkers allowing significant inter-domain variability, and a three-domain model may not reflect the actual conformation(s) of the intact protein alone or in complex with DNA (Fig. 8).
The starting point for modeling the interactions between full-length IN and DNA was usually based upon experimental structures of recombinases (which bind DNA molecules forming Holliday junctions) . The structure of Tn5 transposase as a synaptic complex transition state intermediate came as a breakthrough for integrase modelers . The prokaryotic Tn5 transposase performs a series of catalytic steps, with distinct processing (endonucleolytic cleavage) and joining reactions, which are very similar to those catalyzed by retroviral IN. Also, its catalytic core domain is structurally very similar to those of retroviral INs. Tn5 functions as a dimer and its DNA binding sites provide a clear template for modeling IN…DNA interactions. These models can be used to predict IN amino acid residues important for DNA binding that can be subsequently tested experimentally.
DNA crosslinking studies implicate certain positively charged or hydrophobic residues for involvement in IN…DNA interactions. Such residues identified in HIV-1 IN include His114, Tyr143, and Lys159 . The DNA-binding CTD domain contains less well-conserved residues which have been identified as important for DNA binding, viz. HIV-1 Glu246, Lys258, Pro261, Arg262, Lys264, with some weaker involvement of Ser230 and Arg231 . The somewhat lower degree of sequence conservation in this region may reflect differences in specificity. Finally, the CTD also plays an important role in IN dimerization. When amino acid residues Leu241 and Leu242 along the C-terminal dimer interface are mutated to alanine, they disrupt IN dimerization and strongly reduce catalysis . A comparison of mutants of HIV-1 and ASV INs identified a number of residues (Gln44, Leu68, Glu69, Val72, Ser153, Lys160, Ile161, Gly163, Gln164, Val165, His171, Leu172, Asp229, Ser230, and Asp253) to be responsible for the specificity of binding one of the DNA long terminal repeat ends to IN [97,98]. Further experimental and computational work is needed in order to improve the existing models of the structure of IN and of the interactions with the DNA substrates.
The CCD of IN is responsible for the two enzymatic activities of the enzyme, processing and joining. Both reactions are chemically similar, proceeding via a nucleophilic attack on a phosphorus atom in the DNA backbone by a donor hydroxyl group (water or the newly formed 3′ OH), activated by the catalytic center of the enzyme. In vitro, these reactions require Mg2+ or Mn2+ cations, the latter being more efficient. However, because of physiological abundance, Mg2+ is assumed to be the cofactor in vivo. Whereas the isolated CCD may exhibit some basal enzymatic activities, the full-length enzyme is necessary for joining to proceed.
The nature and number of divalent metal cations required for catalysis are still under debate. The general composition of the IN active site (a constellation of acid groups) and the similarity of the catalyzed reactions to those carried out by other nucleotidyl transferases would strongly indicate the two-metal-cation mechanism elaborated by Steitz & Steitz . However, despite numerous attempts, it has never been possible to obtain an IN/Mg2+ or Mn2+ complex with two metal cations in the active site (i.e. only site I is filled with Mg2+ or Mn2+, while site II is empty). On the other hand, it was possible to introduce two cations into the active site by using physiologically irrelevant but stronger binding metals, such as Zn2+, Cd2+, or Ca2+ with ASV IN , and Cd2+ with HIV-1 IN . The case of zinc coordination is of special note because, firstly, Zn2+ accepts only four, tetrahedrally arranged ligands, which are a subset of the octahedral sphere of the other cations; secondly, although it is not a cofactor of IN catalysis in vivo, it can support endonucleolytic activity in in vitro assays; thirdly, it severely impairs polynucleotidyl transferase activities of IN in vitro; and, fourthly, its potential interaction with the CCD is complicated by the fact that it is the major physiological cofactor of the NTD.
The most instructive case is Cd2+ coordination by IN. One has to clearly distinguish the cases of HIV-1 IN and ASV IN, because the above-mentioned Cys65 residue of HIV-1 IN actually functions as a bridge coordinating both metal centers (sites I and II) simultaneously, replacing in this role the catalytic D64, which is forced away from its active conformation . In this light, the structure of the ASV IN CCD/Cd2+ complex  is more illuminating for providing insights about the possible two-metal-cation functional state of the enzyme.
There is striking similarity between the Cd2+-complexed active site of ASV IN and those of other nucleotidyl transferases, most notably of RNase H, which has been described in a ternary complex with Mg2+ (at sites A and B) and an RNA/DNA hybrid , as well as TN5 transposase  (Fig. 6). First, the metal…metal distance is nearly identical in ASV IN and RNase H, and compatible with what Yang et al.  predict to be required for effective nucleotide-bond hydrolysis (4.0 Å). Additionally, in both cases the two metal centers are connected by two bridging ligands, one of them being a conserved aspartate from the catalytic apparatus (D64 in HIV-1 IN). The other bridge is provided by a water molecule in the ASV IN/Cd2+ (and also Zn2+) complex but in the RNase H complex structure this water is displaced and its role is assumed by an O atom of the scissile phosphate group of the RNA substrate. This phosphate group is even more essential for the integrity of the functional active site of RNase H because it also fills (albeit with less ideal stereochemistry) an additional site in the coordination sphere of site B of RNase H. (Moreover, the next phosphate of the RNA substrate participates in the activation of the water nucleophile presented in the coordination sphere of site A.) Overall, site B has much less regular stereochemistry, in contrast to the nearly perfect geometry of site A. While the simplicity and approximate mirror symmetry of the two-metal-cation active sites would allow two alternative mappings of the metal centers between RNase H and retroviral IN, there is little doubt that the correct mapping is A-I and B-II. This is because, as site A in the RNase H structure, site I of the IN CCD domain has a nearly perfect coordination sphere, while site II is far less regular, with a missing ligand, large scatter of the Cd2+…O distances, and large angular distortions. If this analogy between the active sites of IN and RNase H is correct, then the catalytic metal cation at site I of IN would participate in activating a nucleophilic group (e.g. a water molecule) for attack on a substrate DNA phosphate group. Metal II, on the other hand, would play a role in destabilizing the enzyme-substrate complex, i.e. in driving the reaction forward. At the completion of a reaction cycle, one or both metal cations would probably dissociate as their effective binding (especially at site II) critically depends on the presence of substrate DNA. The parallel between RNase H and retroviral IN has also a chemical aspect because coordination of two Mg2+ cations by RNase H was easy and occurred at low metal ion concentration only in the presence of the RNA/DNA substrate. With the enzyme alone, the effective Mg2+ concentration had to be much higher, at non-physiological levels . With ASV IN, it was not possible to introduce a catalytic metal cation at site II, despite a thorough experimental survey, in which elevated metal concentrations were used . This difficulty is correlated with the flexibility of the glutamate element of the active site which participates in the formation of site II. It may be necessary for the enzyme to use external means, such as substrate assistance, to sequester an Mg2+ cation in site II, with subsequent or simultaneous stabilization of the glutamate side chain.
After an initial spur of activity in the years 1994-2001 that resulted in a wealth of crystal and NMR structures of retroviral INs, only a few new structures have been published in the last eight years. Since many questions, particularly those regarding the structure of the full-length active enzyme and the multiprotein-DNA preintegration complexes, remain to be answered, further structural and biochemical work on this enzyme still needs to be pursued. In addition, IN continues to be an important target for designing anti-HIV drugs, which makes continuation of studies of its structure and function ever more important.
We would like to thank Drs. Alla Gustchina, Anna Marie Skalka, and Mark Andrake for fruitful discussions, and Dr. Peter Cherepanov for providing us with a manuscript and data prior to their publication. This project was supported in part by the Intramural Research Program of the NIH, National Cancer Institute, Center for Cancer Research. The research of MJ was supported by a Faculty Scholar fellowship from the National Cancer Institute.
aIn all buried surface calculations in this article, the reported surface refers to one interacting protein partner, unless stated otherwise.