|Home | About | Journals | Submit | Contact Us | Français|
Human immunodeficiency virus type 1 (HIV-1) integrase inhibitors are in clinical trials, and raltegravir and elvitegravir are likely to be the first licensed drugs of this novel class of HIV antivirals. Understanding resistance to these inhibitors is important to maximize their efficacy. It has been shown that natural variation and covariation provide valuable insights into the development of resistance for established HIV inhibitors. Therefore, we have undertaken a study to fully characterize natural polymorphisms and amino acid covariation within an inhibitor-naïve sequence set spanning all defined HIV-1 subtypes. Inter- and intrasubtype variation was greatest in a 50-amino-acid segment of HIV-1 integrase incorporating the catalytic aspartic acid codon 116, suggesting that polymorphisms affect inhibitor binding and pathways to resistance. The critical mutations that determine the resistance pathways to raltegravir and elvitegravir (N155H, Q148K/R/H, and E92Q) were either rare or absent from the 1,165-sequence data set. However, 25 out of 41 mutations associated with integrase inhibitor resistance were present. These mutations were not subtype associated and were more prevalent in the subtypes that had been sampled frequently within the database. A novel modification of the Jaccard index was used to analyze amino acid covariation within HIV-1 integrase. A network of 10 covarying resistance-associated mutations was elucidated, along with a further 15 previously undescribed mutations that covaried with at least two of the resistance positions. The validation of covariation as a predictive tool will be dependent on monitoring the evolution of HIV-1 integrase under drug selection pressure.
Antiviral therapy has been highly successful in managing human immunodeficiency virus type 1 (HIV-1) infections in individuals who have access to treatment. Current HIV-1 inhibitors target the reverse transcriptase and protease enzymes, as well as the fusion of virus to host cells. To date, there are 22 inhibitors of HIV-1 licensed for use within the clinical setting (30); however, the high mutation rate of HIV-1 (27) drives extensive natural polymorphism and provides a mechanism that allows resistance to develop to all inhibitors (25, 32). As a consequence of HIV-1 drug resistance, there is an ongoing need to generate new inhibitors of HIV both to existing and novel targets.
One such target, HIV-1 integrase, mediates the integration of the reverse-transcribed HIV-1 DNA into the host cell genome and has long been an attractive target for inhibition, as this process is critical within the retroviral life cycle (17). There is the potential to inhibit a number of steps within the HIV-1 life cycle that involve integrase (18); however, the most intensely studied compounds inhibit the strand transfer of the proviral DNA into the host genome (strand transfer inhibitors [STIs]) (12).
HIV-1 integrase comprises a three-domain monomer (5, 6) and a multimeric quaternary structure. The N-terminal domain (codons 1 to 50) is alpha-helical, binds zinc, and is important for protein stability. The peptide chain that links this domain to the catalytic core may be structurally disordered and has not been resolved by X-ray crystallography. The catalytic core domain (codons 51 to 211) forms a two-layer alpha-beta sandwich and contains the active-site triad of residues (Asp 64, Asp 112, and Glu 152) that coordinates magnesium ion binding and the catalysis of the strand transfer process (16). In silico studies have suggested that STIs prevent integrase activity by binding to the active site via the magnesium ion (29). The C-terminal domain (codons 212 to 288) forms an all-beta-sheet SH3 domain that binds DNA in a nonspecific manner (21). The quaternary structure of HIV-1 integrase has not been resolved; however, crystallography and modeling studies have proposed that integrase forms a tetrameric structure (33). This tetrameric model contains two exposed active-site triads and allows for interactions between the three monomeric domains on different chains.
The development of inhibitors that are active against integrase has been slow due to a lack of effective screening methodologies and suitable lead compounds with low toxicity. Nevertheless, two compounds are in advanced stages of clinical evaluation (raltegravir [MK-0518], phase III, and elvitegravir [GS-9317], phase II). As with all other classes of HIV-1 inhibitors, determining and predicting resistance to these compounds will be critical to their effective use in the clinical setting (3, 9). Data from clinical trials and the in vitro passage of HIV-1 in the presence of both lead inhibitors and related compounds has generated an extensive list of mutations that modulate the sensitivity of HIV-1 to these compounds (Table (Table1).1). Resistance data from clinical trials has shown that there are two distinct resistance pathways to raltegravir, which are determined by the presence of either the N155S mutation or the Q148K/R/H mutation (13). Depending on which of these mutations develops initially, the accessory mutations that confer high-level resistance will differ. Both of these resistance pathways also confer high-level resistance to elvitegravir (22), even though clinical trial data and in vitro analysis show that alternative mutations also can cause resistance to this inhibitor.
Because the approval of HIV-1 integrase inhibitors for clinical usage is imminent, we sought to examine the natural polymorphisms within a set of 1,250 integrase inhibitor-naïve sequences (20). We have assessed the extent to which the frequency of integrase inhibitor-associated mutations impact upon the antiviral efficacy and emergence of resistance for this class of compounds. In order to identify mutations potentially associated with resistance, we also have assessed amino acid covariation within HIV-1 integrase sequences.
A total of 1,250 full-length HIV-1 genomes were retrieved from the HIV sequence database (20). The HIV-1 protease, reverse transcriptase, and integrase coding regions were extracted from the full-length genome sequences using a local alignment algorithm (Matcher-Emboss ) and aligned using ClustalW (31). Protease and reverse transcriptase sequences were used to verify the subtype annotation (using STAR ) provided by the database, and integrase sequences were used for subsequent analysis. The 1,250 integrase coding sequences subsequently were filtered for redundancy at 100% sequence identity, resulting in a nonredundant sequence set of 1,165. This sequence set contained 274 ambiguous nucleotides (0.24 per sequence) and 125 ambiguous amino acids (0.11 per sequence). Ambiguous amino acids were assigned (letter X) only when multiple amino acids were possible at the same codon, and these positions were considered mutations from the wild-type sequence. Ambiguous nucleotides were included in the subtype diversity analysis, but ambiguous amino acids were excluded from the analysis of the frequency of amino acids associated with resistance to integrase inhibitors.
A multiple alignment of 1,165 HIV-1 integrase sequences (amino acids 1 to 288) was generated, and this alignment was subdivided into smaller alignments representing the defined HIV-1 subtypes and groups (28) (number of sequences per subtype: subtype A, 165; CRF_01, 58; CRF_02, 43; B, 189; C, 391; D, 41; F, 10; G, 14; H, 3; J, 2; K, 1; N, 5; and O, 21). Sequences were assigned to a subtype alignment in which the Los Alamos database subtype assignation (based on the 9-kb HIV genome) agreed with the STAR subtype designation. A total of 222 sequences were not assigned to the subtypes or groups listed above, usually because the subtype predicted along the length of the genome was recombinant. In many cases, the published recombinant architecture and STAR subtype prediction both indicated that a sequence was not recombinant within the integrase coding region. However, it was decided that a conservative assignment of subtype would minimize errors during downstream analysis.
Subtype-specific multiple alignments were converted into simple position-specific scoring matrices (PSSMs), where the frequency of an amino acid at each position within the alignment was scored. Relative subtype diversity was calculated for each subtype relative to the subtype B PSSM using the L1 norm metric. The amino acid residue frequency at each position within the subtype B PSSM was subtracted from the frequency of the equivalent amino acid and position for each of the other HIV-1 subtypes. The absolute values from this subtraction then were summed. Thus, for instances of a position within integrase where the subtype B PSSM and a non-subtype B PSSM contained differing amino acids with no variation at that position, the score returned was maximal (2). Division by two of the summed absolute values and conversion to a percentage represented a measure of the diversity between a position within two subtype PSSMs. These diversity measures were averaged over a sliding window of 20 amino acids with a step interval of 10.
Data from in vitro and clinical resistance studies on integrase inhibitors were collated from the scientific literature (Table (Table1)1) (1, 4, 7, 8, 11, 13, 19, 22). The frequency of resistance-associated mutations was calculated for each subtype using the subtype PSSMs described above.
Covariation analysis was performed on three independent data sets comprised of subtype A (n = 165), subtype B (n = 189), and subtype C (n = 391). Amino acid positions containing greater than 2% variation from subtype consensus sequences were included in the covariation analysis. Covariation analysis was performed using the Jaccard index (J) (equation 1), where NXY is the number of sequences containing mutations at amino acid positions X and Y and NX0 and N0Y are the number of sequences with a mutation at position X or Y alone. The Jaccard index for each mutational pair was converted into a Z score by a comparison of observed data to a random model (equation 2) (26). The random model (RJ) was calculated by simulating 10,000 sequences using the distribution of amino acids observed at each position within the subtype alignments. For each simulated sequence, amino acids at each position were drawn from the multinomial distribution according to the frequencies at which each amino acid was found in the data sets. The standard deviation of the Jaccard index was calculated for each pair of covarying amino acids using a leave-one-out analysis of sequences containing that amino acid pair.
To compensate for the phylogenetic relatedness of sequences containing covarying amino acid positions, the Z score was weighted using a sequence distance measure (D) (equation 3). The average pairwise distance between sequences containing a pair of mutations (NXY), calculated using a BLOSUM62 matrix, was used as a measure of phylogenetic relatedness. This value subsequently was scaled between 0 and 1 using the maximal pairwise distance [max(data set)] between all sequences in a data set (subtype A, B, or C).
The distance scaling factor (D) was used to modify the Jaccard index and the calculation of Z scores (ZD) for an observed covarying amino acid pair (equation 4). The distance scaling factor used to modify the Jaccard index of the random model (RD) was estimated using the average D of sequences containing covarying amino acids within 10,000 random sequences, generated 30 times.
P values were estimated from the Z scores and subsequently corrected for multiple testing using a false discovery rate (FDR) (14) set to P = 0.001.
Amino acid variation within HIV-1 integrase was consistent with that observed within the whole of the HIV-1 pol gene. Group M subtypes contained approximately 5% average intrasubtype variation and 7 to 8% average intersubtype variation. Group M varied by approximately 11% from group N and 18% from group O. Integrase amino acid variation was not homogenous along the length of the coding region and occurred within each of the three domains within HIV-1 integrase (Fig. (Fig.1).1). Subtypes CRF_01, K, and H showed the greatest diversity from the subtype B PSSM within the N-terminal domain (codons 1 to 50). Subtypes CRF_02, F, G, K, and H varied relative to subtype B, in both the alpha-helical linker and the small DNA binding motif within the C-terminal domain (codons 212 to 288). The greatest inter- and intrasubtype variation was found within the catalytic domain of integrase (codons 51 to 211) and specifically was located between amino acids 90 and 150. Subtypes CRF_01, CRF_02, and A diverged from subtype B by as much as 40% around amino acid position 130. The variable region of the catalytic domain encompassed one of the catalytic amino acids (Asp 116) that is required for enzyme activity and targeted by STIs. Variation around the active and inhibitor binding site suggests that the efficacy of STIs or the genetic barrier to resistance is influenced by virus subtype.
An analysis of the frequency of the occurrence of polymorphisms associated with resistance to HIV-1 integrase inhibitors (Table (Table1)1) showed that mutations V72I and V201I were extremely prevalent across all subtypes and occur in 80 and 46% of these 1,165 sequences, respectively (Fig. (Fig.2a).2a). Other mutations, such as V74I/M, T97A, M154I, V165I, and T206S, occur with a frequency of greater than 12% in certain subtypes. Overall, 25 out of 41 integrase resistance mutations were detected within 1,165 inhibitor-naïve sequences spanning all HIV-1 subtypes. The resistance-associated mutations that occurred as natural polymorphisms within the Los Alamos database tended to occur within the most populated subtypes (CRF_01, CRF_02, A, B, C, D, F, and G). It was impossible to say whether the observed relative absence of resistance mutations in subtypes H, J, K, and N was due to the small numbers of sequences from these subtypes, such that the genetic diversity of these subtypes was not represented, or because of some inherent biological constraint. All of the inhibitors and pathways to resistance described previously include mutations that were present within a broad-spectrum inhibitor-naïve population (Fig. (Fig.2b).2b). However, the key mutations ascribed to resistance (where these are known) were rare; thus, key mutations specific for resistance to raltegravir were found only in the subtype B sequence set, which contained one N155H and three G140S/A polymorphisms. Similarly, the key mutations specific for resistance to elvitegravir were either not detected or detected at very low frequency: H51Y, one subtype A sequence; S147G, three subtype CRF_01 and C sequences; and E157Q, eight subtype CRF_02, A, B, C, and D sequences.
In order to utilize HIV-1 sequence data prospectively rather than purely retrospectively, amino acid covariation analysis was performed to determine whether resistance-associated mutations covaried within inhibitor-naïve sequence sets. Covariation analysis has been used extensively to determine sequence-based interactions between HIV-1 drug resistance mutations in the protease and reverse transcriptase enzymes (2, 14, 34) and to infer mechanistic interactions between amino acids in the V3 loop region of gp120 (15). The absence of drug selection pressure on the integrase sequences used within this study conferred an established problem with the determination of amino acid covariation (10); namely, that covariation could be observed as a function of mechanistic interactions between specific amino acids or as a function of the evolutionary relationships between sequences. Without drug selection pressure, some of the observed polymorphisms may confer a selective advantage, but it is unclear what generates that selection pressure. It is conceivable that there are natural inhibitors of integrase or that the inhibition of HIV by inhibitors targeted to other enzymes indirectly alter the selection pressure applied to the integrase coding region. We hypothesize that pairs of mutations in sequences that cluster within a phylogeny are more likely to be mutations that have been fixed within specific clusters by chance. Conversely, pairs of mutations that are present in sequences distributed across a phylogeny may represent multiple mutation and selection events at these positions. In the latter case, amino acid covariation is more likely to represent functional interactions between these residues rather than evolutionary patterns. For this reason, the detection of covariation was weighted toward sequences that did not cluster within a phylogeny.
Covarying mutations at codons 72 and 201 were present in 27 subtype B sequences, and those sequences had an average overall distance (D) of 0.39 (Fig. (Fig.3a).3a). Mutations at codons 154 and 201 co-occurred in 23 sequences, with an overall distance of 0.15, which was lower because all except three of the sequences containing these mutations were present within a single cluster (Fig. (Fig.3b).3b). The distribution of the 154 and 201 mutations within a sample of the random sequence phylogeny (1,000 sequences from a total random population of 10,000) did not show any clustering (Fig. (Fig.3c).3c). To improve the stability of the prediction model, the distance between sequences containing covarying amino acids within the random model (RD; see Materials and Methods) was estimated using 10 iterations of 10,000 random sequences per subtype. On the basis of this analysis, the RD was fixed at 0.43 for all covarying pairs of amino acids using sequences from subtypes A, B, and C (average RD for subtype A, 0.44 [standard deviation, 0.034]; average RD for subtype B, 0.43 [standard deviation, 0.041]; and average RD for subtype C, 0.42 [standard deviation, 0.042]).
As the distance between sequences within the data model (D) usually was lower than the random model distance (RD), Z scores calculated with the incorporation of the distance model (ZD) were lower than those where this model was not used (Z) (Fig. (Fig.4a).4a). The incorporation of the distance correction factor resulted in the exclusion of 75.3% of the amino acid pairs predicted as covarying significantly using the FDR and a P value of less than 0.001. A comparison between the weighted (ZD) and unweighted (Z) scores suggested that the weighted score was more sensitive to convergence between sequences and that HIV-1 integrase showed evidence of convergence. Simulations of six sets of 1,000 sequences, constrained such that the distance scaling factor D was fixed within the ranges of 0.1 to 0.3 (most similar, least convergent) to 0.6 to 0.8 (most dissimilar, more convergent), with each bin range increasing by 0.1, were generated, and the relationships between ZD and Z were plotted. A binomial curve was fitted to each of the scatterplot series, with correlation values ranging between 0.51 (D, 0.1 to 0.3) and 0.84 (D, 0.3 to 0.5). The lowest correlation coefficients were obtained where D was constrained at the highest and lowest ends of the constraint scale. The data from the simulated data was limited to the range of Z and ZD values observed using the subtype B integrase sequences, and these data and a fitted curve also were presented (r2 = 0.63) (Fig. (Fig.4b).4b). As expected, the analysis of weighted and unweighted Z scores, where the distance scaling factor was constrained to a score similar to that used within the random model (D, 0.43), produced a linear relationship. Covariation between sequences constrained to have a high distance scaling factor, indicative of convergence, produced shallow curves that reached a Z score plateau with an increasing ZD score. While the line generated by observed subtype B data is elevated from the highly convergent simulated sequences, the shape of the curve suggests that the observed data shows a level of sequence convergence at high ZD scores. The minimum ZD scores required to give a significant estimated FDR-corrected probability across a range of P values were calculated for each of the six simulated data sets (Fig. (Fig.4c).4c). The data set for which D was constrained between 0.1 and 0.3 resulted in the highest ZD scores across the range of P values, and the data set for which D was constrained between 0.6 and 0.8 required the lowest ZD scores to achieve significance. The P value cutoff used to analyze the HIV-1 subtype data sets was 0.001. In all cases, the minimum ZD score used to assign significance to an instance of covariation in the observed data sets was higher than the minimum ZD scores generated by the simulations.
The detection of covarying amino acid pairs within HIV-1 integrase was performed independently on sequences from subtypes A (n = 165), B (n = 189), and C (n = 391). Collectively, 108 amino acid positions were detected as covarying; however, only 16 positions (14.8%) covaried in all three subtypes. There was extensive overlap in covariation between subtypes A/B and A/C (20 and 12 positions, respectively), while subtypes B and C shared only two positions that covaried. In addition, each subtype contained a number of unique covarying amino acid positions (A, 24; B, 31; and C, 3). Position 201 exhibited the most covarying partners, though many other positions were detected frequently. Thirteen out of the 29 resistance-associated amino acid positions (Table (Table1)1) were shown to covary, with positions 72, 74, 125, 156, 167, and 206 detected most frequently. Ten of the 13 resistance-associated amino acid positions that were detected as covarying could be linked to form an ad hoc network of positions that varied together (Fig. (Fig.5).5). This ad hoc network was further expanded to include amino acid positions that linked multiple resistance-associated nodes. The addition of amino acid positions that had not previously been implicated in HIV integrase resistance enabled the inclusion of position 203 within the network.
Subtype-dependent sequence variation was observed in all three structural domains of HIV-1 integrase. The alpha-helical N-terminal domain showed the greatest variation on the side of the bundle that interacts with the N-terminal domain from an adjacent integrase chain rather than the side that interacts with the catalytic core (Fig. (Fig.6).6). The C-terminal SH3-like domain (21) was more conserved than the extended alpha-helical strand linking it to the catalytic domain. The catalytic domain was highly variable between amino acids 110 and 145, with additional variation in subtype J between positions 90 and 110. Structurally, this region encompasses a strand of alpha-helix and a poorly resolved or disordered fragment that leads into the active-site residue at glutamine 152. Sequence variation within HIV-1 integrase may be constrained by the maintenance of both the active site and its macromolecular structure. It has been proposed that integrase forms a dimer structure that in turn dimerizes to form a tetramer comprised of four chains of HIV-1 integrase (Fig. (Fig.6)6) (33). This tetrameric structure allows direct interactions between N-terminal, C-terminal, and catalytic domains on different chains. The conserved regions of the catalytic domain form an alpha-helix loop alpha-helix (residues 145 to 185) between the N-terminal domain and another conserved antiparallel beta-sheet region (residues 50 to 90). These regions crucially contain two of the active-site amino acids that coordinate magnesium ion binding and many of the integrase inhibitor resistance mutations. Changes in these regions consequently may affect both the activity of the enzyme and interactions with the zinc ion binding N-terminal domain.
Notably, the major mutations associated with pathways to resistance for either raltegravir or elvitegravir (N155H, Q148K/R/H, and E92Q [Table [Table1])1]) were not present in the inhibitor-naïve sequence population. However, many of the other mutations associated with these resistance pathways were present, suggesting that the rate at which resistance develops is affected by the baseline genotype. There was no obvious relationship between the subtype and the frequency of occurrence of polymorphisms, other than a slight increase for the more frequently sampled subtypes.
An analysis of pairwise amino acid covariation was performed in an attempt to use sequence analysis prospectively to determine whether resistance-associated polymorphisms covaried and whether there were any previously unidentified positions that covaried with these resistance positions. In order to prevent bias toward the detection of phylogenetically related amino acid pairs, a novel modification of the Jaccard index (26) was implemented. The software generated to perform the covariation analysis takes a multiple-sequence alignment and an optional mutation list as its input and therefore is applicable to a wide range of HIV or other protein sequence data sets. The program was written in Perl and is available on request from the authors.
The covariation analysis performed in this work is a prediction; therefore, in principal, it would have been beneficial to perform a receiver operating characteristic (ROC) curve analysis of the sensitivity and specificity of the modified Jaccard index algorithm. While simulating sequences that encompass a range of Jaccard indices and variable distance values is trivial, defining a set of true-positive and true-negative covarying pairs of amino acids from such a simulated data set is either arbitrary or circular. The frequency with which a pair of simulated covarying amino acids occur together within sequences or the Jaccard index (which is simply derived from the initial frequency) will lead to an arbitrary assignment of true-positive or true-negative covariation for that pair. Any derivation of the ROC using this arbitrary assignment method would be meaningless, especially when the error model was incorporated. The inherent error model within this covariation analysis method (26) compares the Jaccard index derived from a real data set to that derived from a simulated random data set and estimates a probability that a pair of observed amino acids covary to a greater extent than would be expected based on a random model. Using the estimated probability of pairs derived from a simulated data set to assign true-positive or true-negative status to a pair of covarying amino acids would be circular, in that both assignment and ROC curve analyses would be performed using the same score and, consequently, the method would appear to perform perfectly. For these reasons, ROC curve validation was omitted from this analysis and is not generally performed during analyses of covariation. Simulations of the effect of varying the distance score that was incorporated into the probability estimated from the Jaccard index were included, and these suggested that the evolution of the HIV integrase coding region showed signs of convergence due to the repeated observation of pairs of covarying amino acids within separate branches of each subtype phylogeny.
An ad hoc network of covarying resistance-associated mutations was elucidated along with 15 novel mutations that covaried with these sites. The results are presented collectively for three subtypes (A, B, and C) (Fig. (Fig.5);5); however, many of these interactions were subtype specific, and it remains to be seen whether there will be differential patterns of compensatory mutations when the inhibitors are used in patients infected with non-subtype B HIV-1. Both the resistance-associated mutations and the 15 highlighted polymorphisms occurred within the catalytic core domain of integrase. However, the proposed tetrameric structure of HIV-1 integrase also provides a mechanism whereby mutations in the N- and C-terminal domains can covary both with each other and directly with the catalytic domain itself (Fig. (Fig.6)6) (33). Interactions of this type were detected using covariation analysis but are difficult to rationalize until the macromolecular structure of integrase is confirmed. The use of covariation analysis as a tool to predict positions within HIV-1 integrase that may be associated with resistance to such inhibitors will be validated by monitoring the evolution of the enzyme under drug selection pressure in a clinical setting.
This work was funded by the Health Protection Agency and the NIHR UCLH/UCL Comprehensive Biomedical Research Centre.
Published ahead of print on 2 July 2008.