|Home | About | Journals | Submit | Contact Us | Français|
Promiscuous guanine (G) to adenine (A) substitutions catalysed by apolipoprotein B RNA-editing catalytic component (APOBEC) enzymes are observed in a proportion of HIV-1 sequences in vivo and can introduce artifacts into some genetic analyses. The potential impact of undetected lethal editing on genotypic estimation of transmitted drug resistance was assessed.
Classifiers of lethal, APOBEC-mediated editing were developed by analysis of lentiviral pol gene sequence variation and evaluated using control sets of HIV-1 sequences. The potential impact of sequence editing on genotypic estimation of drug resistance was assessed in sets of sequences obtained from 77 studies of 25 or more therapy-naive individuals, using mixture modelling approaches to determine the maximum likelihood classification of sequences as lethally edited as opposed to viable.
Analysis of 6437 protease and reverse transcriptase sequences from therapy-naive individuals using a novel classifier of lethal, APOBEC3G-mediated sequence editing, the polypeptide-like 3G (APOBEC3G)-mediated defectives (A3GD) index’, detected lethal editing in association with spurious ‘transmitted drug resistance’ in nearly 3% of proviral sequences obtained from whole blood and 0.2% of samples obtained from plasma.
Screening for lethally edited sequences in datasets containing a proportion of proviral DNA, such as those likely to be obtained for epidemiological surveillance of transmitted drug resistance in the developing world, can eliminate rare but potentially significant errors in genotypic estimation of transmitted drug resistance.
Since 1991, it has been observed that HIV-1 sequences occasionally contain an excess of guanine (G) to adenine (A) substitutions [1-4]. More recently, it has become clear that this ‘G-to-A hypermutation’ reflects the activity of host enzymes belonging to the apolipoprotein B RNA-editing catalytic component (APOBEC) family of cytidine deaminases, most notably apolipoprotein B mRNA-editing catalytic polypeptide-like 3G (APOBEC3G) and apolipoprotein B mRNA-editing catalytic polypeptide-like 3F (APOBEC3F). These enzymes can incorporate into budding HIV-1 particles, and on subsequent infection of target cells, catalyse the deamination of cytidine (C) to uridine (U) in nascent viral reverse transcripts, which can be manifested as G-to-A substitutions in plus-stranded proviral DNA (reviewed in [5,6]). In general, HIV-1 avoids APOBEC-mediated sequence editing through the activity of the virally encoded Vif protein . The occasional detection of hypermutated HIV-1 sequences in vivo [2-4]likely reflects natural variation among viruses in their capacity to suppress cytidine deamination, with some viruses carrying defective vif alleles . Although it has been suggested that some degree of sublethal editing by APOBEC enzymes may contribute to HIV-1 evolution [8-10], extensive G-to-A editing generally leads to mutational impairment of viruses . Genetic data support a long-standing role for primate APOBEC enzymes in innate immune defence against retroviruses , and mutational impairment of retroviral nucleic acids by sequence editing forms one component of this defence .
Sequence variation in lethally edited viruses may reflect qualitatively different biological processes to variation in viable viral genomes (i.e. sequence editing as opposed to purifying selection). It is, therefore, important that lethally edited sequences are identified in analyses that assume data to represent viable genetic material under selection, such as genotypic estimation of drug resistance. In this report, we develop a robust classifier, the ‘APOBEC-mediated defectives index’ (AD index), suitable for efficiently screening lethally edited sequences from large and diverse HIV-1 pol gene datasets. We show that APOBEC-mediated substitutions can introduce spurious ‘drug-resistance mutations’ into sequences that have been inactivated through lethal editing, and that spurious drug resistance in association APOBEC-mediated sequence editing is observed in a proportion of sequences obtained from both blood and plasma. The implications of this finding for surveillance of transmitted drug resistance (TDR) are discussed.
To identify conserved APOBEC target sites in HIV-1 protease and reverse transcriptase genes, sequences of representative lentiviruses were obtained from Genbank and aligned. These included equine infectious anaemia virus (EIAV), AF327877; caprine arthritis encephalitis virus (CAEV), AF322109; ovine maedi-visna virus (OMVV), M34193; feline immunodeficiency virus (FIV), M36968; and bovine immunodeficiency virus (BIV), M32690. Alignments of protease and reverse transcriptase genes from representative primate lentiviruses were obtained from the Los Alamos Sequence Database (www.hiv.lanl.gov).
Three control datasets were used to assess methods for discriminating APOBEC-mediated sequence editing from naturally occurring and antiretroviral drug-selected HIV-1 sequence variation: a ‘wildtype control set’ of 20 nearly full-length reference sequences; a ‘hypermutated control set’ of 28 nearly full-length sequences that were annotated in Genbank as displaying G-to-A hypermutation (these previously collated datasets  included representatives of each of the established group M subtypes); and a ‘drug-resistant control set’ of 697 group M sequences obtained from the Stanford HIV Drug Resistance Database. Sequences in this dataset were generated by population sequencing of plasma virus from patients who were reported as having had two or more different reverse transcriptase inhibitors and two or more protease inhibitors. All of these sequences had at least four major drug-resistance mutations  and represented the following subtypes and circulating recombinant forms (CRFs): B (508), C (51), G (39), D (25), A (23) F (23), CRF01 (16), CRF02 (12).
To assess the potential effect of APOBEC-mediated sequence editing on epidemiological surveys of TDR, a set of 6437 reverse transcriptase and protease sequences obtained from distinct antiretroviral therapy (ART)-naive individuals was retrieved from Genbank. This dataset represented 77 studies of 25 or more individuals and included the following subtypes and CRFs: B (2499), C (1285), CRF02 (736), A (556), CRF01 (533), G (189), D (172), G (137), other recombinants/unclassifiable sequences (96). All sequences were at least 600 nucleotides in length and were annotated with respect to the tissue source, date, and country of sampling. A complete list of the publications from which sequences were obtained can be found online (http://cpr.stanford.edu/cpr/).
To quantify the extent of G-to-A substitution in protease and reverse transcriptase sequences, several previously described approaches were applied. The extent of G-to-A substitution can be estimated through comparison to a suitable reference sequence (e.g. a consensus sequence for the viral subtype under analysis). Two general methods that use this approach, ‘G-to-A preference’ (number of G-to-A substitutions/number of all substitutions) and ‘G-to-A burden’ (number of G-to-A substitutions/ number of G nucleotides in the reference sequence), are implemented in the programme Hypermut . A similar measure defined by Pace et al.  and a recently updated version of the Hypermut program (Hypermut 2.0) exploit the predilection of APOBEC enzymes to induce G-to-A mutations in specific dinucleotide contexts (GG-to-AG for APOBEC3G and AG-to-AA for APOBEC3F) to provide enzyme-specific measures of G-to-A substitution. The ‘consolidated 3G’ and ‘consolidated 3F’ scores are defined by Pace as the number of G-to-A substitutions at target dinucelotides per number of target dinucleotides in the reference sequence . In the Hypermut 2.0 program, Fisher’s exact test is used to define an excess of mutations (hypermutation) within specified dinucleotide contexts relative to a control context . A fifth measure, ‘product-substrate ratio’ (PS ratio), can be calculated without comparison to a reference sequence. PS ratio is defined as the number of substrate dinucleotides in the query sequence per number of product dinucleotides in the query sequence .
Viral subtypes were determined for all sequences using the programme STAR . Drug resistance was assessed using a list of ‘surveillance drug-resistance mutations’ (SDRMs) endorsed by the WHO for global surveillance of transmitted HIV-1 drug resistance (http://www.who.int/hiv/drugresistance/en/) and designed to standardize estimates of TDR across geographic regions and viral subtypes . For epidemiological surveillance of drug resistance, the presence of a single SDRM in a sequence qualifies it as resistant.
Receiver-operator curves were plotted in R using the ROCR package . For mixture modelling, we assumed two underlying distributions: a Poisson distribution, reflecting the large number of sequences with low A3GD indices and a uniform or normal distribution, reflecting the distribution of the remaining ART-naive set. The expectation-maximization algorithm was applied to identify the parameters of distributions and to determine the maximum likelihood classification of sequences as lethally edited as opposed to viable/wildtype. A bootstrap analysis was run to assess the variability in the data resulting from the small number of sequences with high A3GD indices.
Sequence context at the dinucleotide level strongly affects the efficiency with which cytidine deamination is carried out by various APOBEC enzymes, such that 3G tends to be associated with GG-to-AG substitutions, whereas 3F is associated with GA-to-AA substitutions [20-22]. The genomic region of HIV-1 typically sequenced for resistance testing contains nine conserved tryptophan (Trp) residues. The single codon for Trp (TGG) contains at least one 3G target site at which a G-to-A substitution leads to a stop codon (potentially, two 3G target sites, or one 3G target site and one 3F target site can be present, depending on sequence variation at the downstream codon position). As stop codons within open reading frames generally indicate mutational inactivation of proteins, stop codons at conserved trytophan sites are excellent indicators of lethal sequence editing by APOBEC. To identify other mutations that might similarly serve as indicators of lethal, APOBEC-mediated sequence editing, we defined highly conserved 3G and 3F target sites in an alignment of protease and reverse transcriptase genes from representative exogenous lentiviruses. This analysis identified 34 3G and 27 3F target sites conserved throughout the HIV-1 M group and at which G-to-A substitutions give rise to rare mutations indicative of APOBEC-mediated editing (Table 1 ). Conserved APOBEC target sites identified in this analysis were used to define a classifier of lethal APOBEC-mediated editing, the AD index, which is defined as the number of mutations (excluding major drug-resistance mutations) arising through G-to-A substitutions at conserved APOBEC target sites divided by the number of conserved APOBEC target sites in the query sequence. Like other measures of APOBEC-mediated G-to-A substitution, this index can be configured as general, or as 3G or 3F specific, based on target site preferences. To be considered indicator sites suitable for inclusion in calculation of the index, target motifs were minimally required to occur at sites where sequence editing would result in amino acid substitutions and to be conserved across more than 98% of HIV-1 group M isolates in the reference alignment . Of the 61 sites we identified, however, 34 (56%) were also conserved across all primate lentiviruses and 25 (40%) were conserved across all exogenous lentivirus lineages (Table 1). The rarity of G-A substitutions at these sites in HIV-1 was confirmed by their low prevalence (<0.1%) within the Stanford HIV Drug Resistance Database and by reference to other published studies of HIV-1 sequence variation [24,25]. The mutations used to calculate the AMD index are not necessarily lethal to virus replication, but their rarity in a highly mutable pathogen such as HIV-1 indicates that they are efficiently removed by purifying selection in replicating virus populations .
The reliability of mutation-based indices for identifying lethally edited sequences was assessed using control datasets. The protease-reverse transcriptase region was extracted from control wildtype and hypermutated genome sets and assessed using general and enzyme-specific indices and four previously described measures of G-to-A substitution (see Methods). Five sequences extracted from hypermutated genomes showed no evidence of excessive G-to-A substitution by any of the measures used and were excluded from subsequent analysis. All 3G specific measures performed better than general or 3F-specific measures (data not shown), which may reflect bias in the control set. As previous studies have also indicated a greater role for 3G in hypermutation of HIV-1 [4,10,15], we however restricted our subsequent analysis to this enzyme.
Figure 1a shows the generalized G-to-A measures (G-to-A preference and G-to-A burden), the 3G-specific measures (consolidated 3G score and 3G P/S ratio), and the ‘A3G-mediated defectives index’ (A3GD index) applied to the wildtype and hypermutated control sequence sets. Figure 1b shows the same five measures applied to drug-resistant and hypermutated control sequence sets. The mean and median differences in scores for each of the two groups were highly statistically significant for all five methods (P < 0.01); however, overlap between scores for the drug-resistant and hypermutated sets was observed for some measures (Fig. 1). Plotting of receiver-operator curves (data not shown) revealed that the A3GD index performed as well, and in some cases more reliably as a classifier of lethal A3G-mediated editing than other measures. For measures utilizing comparison to a reference sequence, it was observed that natural and drug-selected variation in the query sequence could affect performance, particularly if the reference sequence was not closely related to the sequence under analysis.
A3GD indices were calculated for a set of 6203 reverse transcriptase and protease sequences obtained from ART-naive individuals. The heterogeneity in A3GD indices was explored using a mixture model that assumed two underlying distributions: a Poisson distribution, reflecting the large number of sequences with low A3GD indices (i.e. unedited sequences) and a uniform or normal distribution, reflecting the distribution of the remaining, potentially 3G-edited set (Fig. 2). On the basis of this analysis, an A3GD index of more than 0.08 (equivalent to three or more A3GD index mutations in a full-length protease-reverse transcriptase sequence) indicates a more than 99% probability of mutational inactivation through APOBEC-mediated editing.
Applying this cut-off to the ART-naive sequence dataset revealed that nine of 77 studies contained at least one hypermutated sequence (Table 2 [4,5,27-34]). Although 12 sequences from four studies were annotated as hypermutated, 16 sequences from five studies were not. Of these 16, 13 were classified as hypermutated by the Hypermut 2.0 program. A total of eight sequences classified as lethally edited by the A3GD index did not contain any stop codons. 75% of lethally edited sequences from drug naive patients had major resistance mutations (including SDRMs) arising through G-to-A changes at conserved APOBEC target sites and no other drug-resistance mutations, suggesting that the apparent TDR in these patients was an artifact introduced by A3G-mediated sequence editing. In these studies, the estimated prevalence of drug resistance within the study population (as defined by the presence of one or more SDRMs) was lower by a median of 2.3% (range: 0.4-10.3%) if the lethally edited sequences were excluded. From the current International AIDS Society (IAS)-USA major  and/or SDRM lists , G73S and D30N in protease and D67N and M184I in reverse transcriptase, all of which can arise through G-to-A substitutions at highly (100%) conserved APOBEC dinucleotide target sites, were the mutations most commonly associated with lethal editing. Other drug-resistance mutations could potentially arise through APOBEC-mediated G-to-A substitutions ; but in these cases, the introduction of that mutation through a single G-to-A substitution was dependent on rare polymorphisms at synonymous codon positions or within neighbouring codons, or on combined 3G-mediated and 3F-mediated substitutions (Table 1).
In the ART-naive dataset, lethally edited sequences were significantly more likely to be obtained from proviral DNA of peripheral blood mononuclear cells (19/640, 3%) than sequences obtained from plasma (9/5563, 0.2%; P < 0.001), as might be expected . Viral genetic material that had apparently been extensively edited by APOBEC3G, however, was also identified in a small proportion of plasma samples (Table 2), and although this observation was unexpected, it was supported by the identification of similarly edited sequences (~0.1%) in a separate dataset comprising 9996 sequences obtained from plasma and submitted to the Stanford Medical Centre for resistance testing. In the Stanford data set, the presence of potentially lethally edited sequences in plasma was always associated with low plasma viral load. It is not clear whether plasma sequences with high A3GD index scores, but no stop codons, represent viruses capable of initiating a new round of infection, and in these cases, editing should be considered potentially rather than necessarily lethal. In theory, genomic RNA expressed from otherwise defective proviruses may be packaged into viral particles provided a second, viable provirus is present to provide the necessary viral proteins in trans.
Several inexpensive, generic, and fixed-dose triple antiretroviral drug combinations have shown remarkable success in pilot HIV-1 treatment programmes in low-income and middle-income countries [36-39]. Scaling-up access to ART in low-income and middle-income countries should, therefore, reduce morbidity and mortality in millions of HIV-1-infected persons . Standardized, simplified treatment protocols have been developed to guide clinical decision-making accompanying antiretroviral drug use in areas where resource limitations preclude the intensive clinical management typical of high-income countries . The threat from TDR in this environment is uncertain [41-43]. Accordingly, the World Health Organisation has established a global programme for genotypic surveillance of TDR in representative untreated populations in regions where access to ART is expanding, so that any rise in the prevalence may be detected at any early stage [18,44]. As a relatively low threshold has been set for public health action in response to TDR (>5% prevalance) , it is important that surveillance protocols are implemented scrupulously. Reporting a falsely low prevalence of drug resistance could undermine the success of standard therapies, whereas reporting a falsely high prevalence could result in the abandonment of current effective and relatively inexpensive treatments.
In this report, we found that spurious drug-resistance mutations are occasionally observed in association with patterns of substitution indicative of lethal, APOBEC-mediated sequence editing. Substitutions that can be introduced by sequence editing and by drug selection include mutations associated with each of the three main antiretroviral drug classes: protease inhibitors (G73S, D30N), nucleoside inhibitors (M184I), and nonnucleoside (G190SE) reverse transcriptase inhibitors. Such artifactual TDR within lethally edited sequences, however, is only likely to occur in datasets containing a significant proportion of sequences derived from proviral DNA, such as samples stored using dried blood spot (DBS) filter paper .
As DBS filter paper is a highly attractive medium for transporting viral material in resource-limited settings , and SDRMs should not be removed from the list until it is shown that this will not significantly diminish the sensitivity of surveillance protocols, we propose that screening for lethal editing should be carried out routinely in surveillance programmes incorporating DBS-derived samples. The resource implications of screening are minimal, as both A3GD index-based and PS ratio-based screening protocols have been integrated into the Calibrated Population Resistance (CPR) tool (an online and publicly accessible programme for estimation of transmitted resistance) on the Stanford HIV drug resistance website (http://cpr.stanford.edu/cpr/). An advantage of using A3GD index-based and PS ratio-based approaches (as opposed to other measures of G-A substitution) is that they can be calculated without a reference sequence, thereby obviating the need to accurately characterize natural diversity in the query sequence set, which is likely to be challenging in areas where subtype diversity is high (i.e. mainly West and Central Africa, but increasingly other areas as well [47-49]). CPR also provides PS ratio values for sequences as a second measure of G-to-A substitution. The A3GD index and PS ratio are also appropriate classifiers for rapidly screening lethally edited sequences from large protease and reverse transcriptase sequence datasets to be used in other analyses, such as molecular epidemiological investigations utilizing phylogeny [47,48]. The observation that a small proportion of sequences obtained from plasma contained patterns of substitution indicative of lethal editing, including stop codons in some sequences, may have implications for the design of studies investigating minority strains in HIV-1 infection, and also suggests that in rare cases, screening for lethal sequence editing may be relevant to procedures based on plasma samples, such as interpretation of drug resistance in sequences obtained during routine genotypic testing.