|Home | About | Journals | Submit | Contact Us | Français|
Human immunodeficiency virus type 1 (HIV-1) mutations that confer escape from cytotoxic T-lymphocyte (CTL) recognition can sometimes result in lower viral fitness. These mutations can then revert upon transmission to a new host in the absence of CTL-mediated immune selection pressure restricted by the HLA alleles of the prior host. To identify these potentially critical recognition points on the virus, we assessed HLA-driven viral evolution using three phylogenetic correction methods across full HIV-1 subtype C proteomes from a cohort of 261 South Africans and identified amino acids conferring either susceptibility or resistance to CTLs. A total of 558 CTL-susceptible and -resistant HLA-amino acid associations were identified and organized into 310 immunological sets (groups of individual associations related to a single HLA/epitope combination). Mutations away from seven susceptible residues, including four in Gag, were associated with lower plasma viral-RNA loads (q < 0.2 [where q is the expected false-discovery rate]) in individuals with the corresponding HLA alleles. The ratio of susceptible to resistant residues among those without the corresponding HLA alleles varied in the order Vpr > Gag > Rev > Pol > Nef > Vif > Tat > Env > Vpu (Fisher's exact test; P ≤ 0.0009 for each comparison), suggesting the same ranking of fitness costs by genes associated with CTL escape. Significantly more HLA-B (χ2; P = 3.59 × 10−5) and HLA-C (χ2; P = 4.71 × 10−6) alleles were associated with amino acid changes than HLA-A, highlighting their importance in driving viral evolution. In conclusion, specific HIV-1 residues (enriched in Vpr, Gag, and Rev) and HLA alleles (particularly B and C) confer susceptibility to the CTL response and are likely to be important in the development of vaccines targeted to decrease the viral load.
Human immunodeficiency virus type 1 (HIV-1) infection can elicit strong human leukocyte antigen (HLA) class I-mediated immune responses from HIV-specific cytotoxic T lymphocytes (CTL) (55, 66), which are thought to be important mediators of disease progression. However, viral sequence changes in critical amino acid residues of HLA-presented epitopes and immediately surrounding regions can be selected for their ability to effectively reduce the potency of the CTL response during the course of infection (escape variants) (5, 6, 10, 30, 37, 54, 57). The selective pressure of this CTL escape can be balanced by viral fitness constraints (16, 20, 27, 42, 45, 46, 53), with several studies finding CTL escape variants that were less fit than the original strains (7, 11, 16, 27, 40, 45, 46, 62). Importantly, studies of humans and of macaques have shown that some escape variants revert to the original sequence when infecting a host with a different HLA genotype (16, 27, 40), again suggesting that CTL escape in these instances was associated with a loss in viral fitness. Additional evidence demonstrates that despite reduced fitness, CTL escape variants can be transmitted from one host to another (2, 28, 29, 40, 41, 47), suggesting that they can persist in a population with common HLA alleles. Indeed, viral loads have been found to be correlated with HLA supertype frequencies, suggesting a population-wide selection for CTL escape mediated by the more common HLA alleles (65).
The identification of viral variants predicted to have higher viral fitness has important implications for vaccine design. A vaccine that elicits immune responses to the fittest viral variant(s) and blocks common immune escape routes may not necessarily provide sterilizing immunity but might be able to produce a sufficient immune response to slow disease progression (50). Such a vaccine might also reduce secondary transmission of the virus, which has been shown to be correlated with plasma viral-RNA loads (18, 58, 61). This type of vaccine could have dramatic public health benefits.
Because CTL epitopes are HLA restricted, the sequence changes of common viral escape pathways are characteristic of the HLA genotype of the infected host. Recently, novel computational methods exploiting this fact have been developed to identify amino acid changes associated with HLA genotypes in large cohorts of HIV-1-infected individuals (8, 12). With these methods, bias due to viral lineage can be taken into account using phylogenetic correction, allowing a more accurate assessment of HLA associations than previous efforts (39, 48). These methods can predict whether specific amino acid residues are generally susceptible or resistant to the CTL response mediated by the corresponding HLA allele. However, a simple picture of escape and reversion could not explain the HLA-amino acid associations found by Bhattacharya et al. (8), as particular substitutions were found to change roles in terms of resistance or susceptibility depending on the specific patient's CTL tested. Furthermore, overlapping epitopes presented by different HLA molecules can place conflicting pressures on particular amino acid residues. Detailed analysis of the HIV-1 epitope SLYNTVATL has revealed substantial complexity of cross-recognition, escape, and reversion (32).
Because HLA alleles that are common in a given population can drive the fixation of the corresponding escape variants, consensus sequence peptides used in laboratory assays are likely to encode some viral escape variants incapable of eliciting a cellular response in vitro (3, 4). This could prevent the identification of certain epitopes when using consensus peptides, whereas novel computational methods may be better able to identify them.
Previous studies used phylogenetic correction to investigate HLA-amino acid associations in single genes or subgenomic fragments, primarily from subtype B sequences (8, 12). A high density of associations was found in Nef compared to protease, reverse transcriptase (RT), and Vpr, suggesting that HIV-1 proteins experience differential selection pressures in response to CTL recognition (12). Here, we provide a comparison of all HIV-1 proteins, allowing a more comprehensive assessment of viral evolution in CTL epitopes. Furthermore, we have focused our analysis on HIV-1 subtype C, the most globally prevalent and rapidly expanding subtype. We analyzed 261 full-length HIV-1 genomes from a cross-sectional South African cohort, investigating the relationship of HLA class I alleles and amino acid evolution to identify residues predicted to be either susceptible or resistant to HLA class I-mediated immune responses. The amino acid changes identified were further evaluated for their association with the plasma viral-RNA load. Three different phylogenetic correction methods were utilized because of their complementary strengths. We also focused in greater depth on proteins for which additional sequences (in Gag, Pol, Env, and Nef) and corresponding host HLA information were available. We organized the identified associations into immunological sets in an attempt to identify the most important residues and genes as candidates for a CTL-based HIV-1 vaccine.
A cross-sectional cohort of HIV-1-infected individuals from KwaZulu-Natal province of South Africa was ascertained with informed consent as previously described (35). Because of the cross-sectional nature of the study design, most individuals were expected to be in the chronic phase of infection. All individuals were sampled prior to antiretroviral therapy. This study was approved by the internal review boards of the University of Washington, Los Alamos National Laboratories, Massachusetts General Hospital, and the University of KwaZulu-Natal.
The sequences of full-length viral genomes, some of which (n = 244) were reported previously (60), were obtained by PCR amplification of plasma-derived viral RNA, followed by cloning and sequencing. This data set was enlarged using the same techniques, and the sequences have been reported in association with this paper. Hypermutated genomes, intersubtype recombinants, and sequences with no corresponding HLA information were not included in the analysis (n = 11). Among the remaining sequences (n = 261), all were subtype C except two (one subtype B and one subtype A). Additional gag, pol, env, and nef sequences were obtained using population sequencing as previously described (16, 35, 39, 40). HLA genotypes were obtained as previously described (35). The CTL responses of the study participants were experimentally determined using gamma interferon enzyme-linked immunospot (ELISPOT) assays with a set of overlapping peptides from the consensus of HIV-1 subtype C sequences, as previously described (35). From the ELISPOT data, HLA-peptide associations were defined as those from at least one individual who reacted at least once to the peptide.
Viral sequences were codon aligned using MacClade 4.08 (43). The 261 full-length HIV-1 genomes were divided into fragments of 1,000 nucleotides overlapping by 50 nucleotides (ignoring gene boundaries) to minimize the impact of intrasubtype recombination on phylogenetic reconstructions. Each partition was used to generate a maximum likelihood phylogenetic tree, as previously described (8). Columns within the alignment that were >10% gaps were considered null for the construction of the trees. The trees were then evaluated using three distinct methods, maximum likelihood character state analysis followed by a likelihood ratio test (MLL) (available at http://www.microsoft.com/mscorp/tc/computational-tools.mspx), maximum likelihood character state analysis followed by a Fisher test (MLF) (8), and parsimony character state analysis followed by a Fisher test (parsimony) to identify HIV-1 amino acid changes associated with the HLA genotype. Because each method evaluated HLA and amino acid associations at each amino acid site individually, the results were compiled based on the gene to which the amino acid belonged. The results for overlapping reading frames were thus partitioned to the gene corresponding to the reading frame used.
In all three methods, only unambiguous matches were tabulated. For HLA alleles, two-digit genotypes were excluded as nonmatches when they shared the same first two digits of a four-digit HLA allele being considered. They were excluded because the true genotype might have been identical to the four-digit allele if the precision of the two-digit HLA genotype test had been to four-digit resolution. For amino acid comparisons, gaps were considered unknown and were excluded. Because ambiguous codons could result in multiple amino acid translations, they were also excluded. Ambiguous codons were present only in the extended data set, which included sequences derived from population sequencing (16, 35, 39), whereas the full-length genome sequences were derived from unambiguous cloned sequences (60). In the MLF method, however, amino acid ambiguities from gaps and IUPAC ambiguities occurred in only a small number of cases because missing or partial data were inferred from the tree. This method inferred the parental sequence on leaf nodes with gaps. Thus, change was not often observed in these parts of the tree.
Two of the three methods (MLL and MLF) have been previously described (8, 12, 14). Both of these methods involved the evaluation of single-amino-acid changes and their relationship with the HLA genotype. For this study, overlapping 9-mers were also evaluated for an association with the HLA genotype using MLL and MLF.
The third method (parsimony) was developed to provide a more simplified approach and was compared to the others. For the parsimony method, the nucleotide sequence was translated into a protein sequence using MacClade 4.08 (43). The maximum likelihood trees of each 1,000-base-pair partition were used to make a parsimony-based reconstruction and inference of the amino acid sequences at the ancestral nodes. Using PAUP* 4.0b11 (64), the changes from the most recent nodal ancestor were recorded. Fisher's exact test was used to evaluate the relationship of these sequence changes and HLA genotypes using STATA 8.0 (StataCorp LP, College Station, TX). Similar to the other two methods, the resulting P values were used to calculate q values (the expected false-discovery rates), and only the associations with q values of <0.20 were considered.
For all three methods, the P values from the two-digit and four-digit HLA genotypes were analyzed separately for the calculation of q values. In addition, the calculation of q values was performed separately on each protein. Contingency tables were used for MLF and parsimony, and the odds ratio determined the predicted impact of the amino acid residue evaluated. The following expressions for the odds ratios indicated the impact: [(A→!A and HLA)/(A→!A and !HLA)]/[(A→A and HLA)/(A→A and !HLA)] (odds ratio >1, escape [susceptible]; odds ratio <1, repulsion [resistant]) and [(!A→A and HLA)/(!A→A and !HLA)]/[(!A→!A and HLA)/(!A→!A and !HLA)] (odds ratio >1, attraction [resistant]; odds ratio <1, reversion [susceptible]). A and !A indicate the presence and absence, respectively, of a given amino acid residue. Similarly HLA and !HLA indicate the presence and absence, respectively, of the HLA allele. The arrows indicate “changes to.” The MLL method used four statistical models for escape, reversion, attraction, and repulsion (14).
For all three methods, one can interpret these terms as follows: escape means that in the presence of the HLA, there was pressure to change away from A (A to !A); reversion means that in the absence of the HLA, there was pressure to change to A (!A to A); attraction means that in the presence of the HLA, there was pressure to change to A (!A to A); and repulsion means that in the absence of the HLA, there was pressure to change away from A (A to !A). For both the escape and reversion cases, the given amino acid was most likely to be contributing to the susceptible form of the epitope. Attraction and repulsion indicate amino acids likely to be associated with the immunologically resistant form of the epitope (Fig. (Fig.11).
Associations were placed into the following categories: raw associations (all unique associations) and individual associations (reducing “escape” or “reversion” to “susceptible” and “attraction” or “repulsion” to “resistant”; also removing associations involving a two-digit HLA allele when the same association involving a four-digit allele shared the same first two digits) plus site associations (site with any association) and immunological sets (described below).
The top 274 results from each of the three methods were compared using receiver operating characteristic curves (available at http://www.hiv.lanl.gov/content/immunology/hlatem/study5/index.html) that considered each individual association (after those that were due to linkage disequilibrium were removed) a true-positive result if it was embedded in or within 3 amino acids of a known epitope. Two methods at a time were subjected to a permutation test in which each result was randomly swapped between the two methods. The area under the receiver operating characteristic curve of the real results was then compared to the area under the curve for the permuted results.
Amino acid entropy was calculated using methods previously described (38). The Wilcoxon signed-rank test was used to evaluate the relationship of amino acid entropy and the presence of an HLA association per amino acid site.
The significant associations from both the single-amino-acid comparisons and the 9-mer comparisons were combined with validation data sets, which included all known epitopes, the reactive peptides identified with ELISPOT reactions, and known epitope motifs (MotifScan [http://www.hiv.lanl.gov/content/immunology/motif_scan/motif_scan] and EpiPhred [http://www.codeplex.com/MSCompBio]), each with the corresponding HLA genotype. This information was grouped into maps of putative immunological sets, defined as the minimum number of epitopes required to explain the observed HLA-amino acid association patterns, taking 9-mer overlap, HLA linkage disequilibrium, and HLA ambiguity into account. Immunological sets were defined as a group of HLA-amino acid associations within the same HLA allele and in the same epitope region. For example, a set of one or more individual associations with the same or a related HLA that were very near one another in terms of sites in the sequence were considered to be members of the same immunological set. HLA alleles were considered related based on two-digit and four-digit resolution (i.e., B35 and B3501) or HLA linkage disequilibrium. The same definitions were used by Brumme et al. (12).
The identification of immunological sets was automated. The associations and validation data sets were grouped into candidate sets by sliding a 13-amino-acid window over the region. The HLA that had the most significant association with the amino acid patterns in the set was considered the founder HLA. If two HLA alleles were in linkage disequilibrium and an associated amino acid was contradictory (the susceptible amino acid for one was the resistant amino acid for the other), they were split into separate sets. Known or potential epitopes (based on the presence of anchor motifs) with appropriate HLA-presenting molecules that overlapped onto the grouped associations were mapped onto the immunological sets. Immunological sets might include only a single amino acid association or, alternatively, if there were complex escape alternatives for an epitope, multiple associations. They were constrained to be no longer than 16 amino acids (for example, a 10-amino-acid epitope and 3 amino acids on either side) to allow for processing mutations (34). To be conservative, we chose to evaluate three sites flanking each epitope because they were the fewest sites with the most likelihood of involvement in cleavage (34).
We scanned for single HIV amino acid associations with HLA as described above using the MLL and MLF methods. We also scanned with amino acid patterns in sets of contiguous 9-mers (epitope length fragments) that might be associated with the HLA. If multiple escape routes were available to a given HLA-presented epitope, then by scanning for HLA-associated patterns in contiguous 9-mers, we would have greater statistical power to identify the variety of forms. The 9-mer references used in 9-mer association tests, like single amino acids, represented all observed patterns in the input alignments and tree ancestry. We took the output at the immunological-map-processing stage and, when possible, reduced the 9-mer with HLA associations to potentially smaller n-mers representing just those amino acids that varied and the intervening conserved amino acids.
Following Brumme et al. (12), a hierarchal validation scheme was applied to the grouping of immunological sets. The HIV immunology database has two lists of experimentally defined epitopes: those that are optimally defined based on a set of stringent criteria and assembled in an annual review article (22), known as the “A list,” and the full database based on all of the epitopes that can be assembled from the experimental literature, known as the “B list” (http://www.hiv.lanl.gov/content/immunology/). The validation datasets were ranked into A list first and B list second. The third ranking, known as “motif,” included all known motifs from the HIV database (MotifScan [http://www.hiv.lanl.gov/content/immunology/motif_scan/motif_scan]) and epitopes predicted using Epitope Predictor (http://www.codeplex.com/MSCompBio). The “best” kind of available validation data was selected for each immunological set (A > B > motif), and all others were removed. The removal of these identical columns simply served to make the maps easier to read. Maps for all proteins and lists of associations can be viewed at http://www.hiv.lanl.gov/content/immunology/hlatem/study5/index.html.
The ratio of susceptible to resistant amino acids at each site association was calculated using individuals without the corresponding HLA allele, taking linkage disequilibrium into account. The HLA alleles were reduced to two-digit designations, with the exception of the A68 and B15 alleles, which were kept at four digits because they represented more than one HLA supertype.
To determine the number of validations expected due to chance, the individual associations were reduced to unique site associations in each immunological set. These sites were randomized 250 times. The number of sites found embedded in or within 3 amino acids of a known epitope were determined in both the original data set and the randomized datasets. The proportion of associations from the original data set was then compared to the randomized one for each gene.
At each amino acid site with an HLA association, the viral loads of individuals with each corresponding HLA and a change away from the susceptible residue (A→!A) was compared to the viral loads of individuals with no change away from the susceptible residue using the Wilcoxon signed-rank test. q values were determined from the resulting P values using QVALUE (63). The results with a q value of <0.2 were considered significant. The same analysis was repeated using the resistant instead of the susceptible residues.
Additional sequences in Gag, Pol, Env, and Nef were available for evaluation (16, 35, 39, 40). Extended data sets were generated containing these additional sequences, as well as the full-length genome sequences in the corresponding regions. Because the additional sequences had variable ends due to inconsistent sequencing runs and/or different primers, only amino acid sites with <10% missing data were included in the analysis, which served to trim the alignment under consideration to the core region of the added sequences. The results of the analysis of the extended data set were tabulated and compared to those obtained from the full-length genome analysis over the respective protein regions.
Accession numbers for full-length genomes, protein alignments, and host HLA information are available at http://www.hiv.lanl.gov/content/immunology/hlatem/study5/index.html.
Full-length HIV-1 genomes (n = 261) from a cross-sectional South African cohort were screened for amino acid changes associated with HLA alleles. Single amino acids, as well as 9-mers, were evaluated using phylogenetic correction methods that took the viral lineage into account. Amino acids were categorized into those predicted to confer susceptibility or resistance to the corresponding HLA-mediated immune response. As defined in Materials and Methods, susceptible residue categories included negative correlations, i.e., the amino acid was enriched when the corresponding HLA was absent from the host or the amino acid was infrequent when the HLA allele was present within the host. Resistant categories included positive correlations, i.e., the amino acid was infrequent when the HLA allele was absent from the host or the amino acid was enriched when the HLA allele was present within the host.
Using the three methods discussed below, we identified 948 HLA-amino acid associations categorized as reversion, escape, repulsion, and attraction (raw associations) (see Materials and Methods for definitions) with a q value of <0.2. Reducing the q value to 0.05 resulted in a loss of 61% of these associations, 58% of which were supported by ELISPOT results. To include as many true-positive results as possible, we chose to further evaluate all associations with a q value of <0.2. The odds ratios ranged from 0 to 310 (excluding infinite odds ratios; see http://www.hiv.lanl.gov/content/immunology/hlatem/study5/full/index.html for a complete list). After the amino acids were recategorized as either susceptible or resistant, there were 532 individual HLA-amino acid associations (individual associations). Using a sliding window 9 amino acids in length (9-mers), 26 additional individual associations were identified, each involving a single 9-mer with two or more variable sites, for a total of 558 individual associations. These were distributed at 257 positions along the genome (site associations). After the proximities of individual associations linked to a particular HLA, overlap of 9-mers, epitope boundaries, and HLA linkage disequilibrium were taken into account, these data were organized into 310 groups of HLA-amino acid associations, each predicted to be related to single epitope/HLA combinations (immunological sets). These individual associations and immunological sets were mapped in relation to the consensus of the sequences analyzed (Fig. (Fig.11 shows p17 and p24 of Gag; see http://www.hiv.lanl.gov/content/immunology/hlatem/study5/index.html for complete proteome maps). The maps showed that multiple residues at a single site or multiple sites within or near an epitope were often involved in immune escape and reversion. The maps also illustrated sites with opposing selective pressures (Table (Table1),1), i.e., a single site with the same residue involved in two individual associations, one predicted to be susceptible to the CTL response mediated by one HLA allele and the other predicted to be resistant to the CTL response mediated by a different HLA allele.
The regions with the most immunological sets were at Pol amino acid position 607 with 33 sets and Nef amino acid position 83 with 28 sets, suggesting that these regions were under the most conflict due to immune pressure. The immunological set with the most individual associations (n = 7) was found within the Nef cluster and included a B*44-restricted epitope (KRQEILDLW), suggesting that this epitope is under the most conflicting selective pressure. The next largest immunological sets were in Nef, Tat, and Pol, each of which included six individual associations involving the C*0404-restricted epitope HSQRRQDIL (also in the large Nef cluster) and the B*42-restricted epitopes QPKTPCNKCY and YPGIKVRQL, respectively.
At each site association, plasma viral RNA loads in individuals with and without the susceptible residues were compared for all individuals with the corresponding HLA alleles. Under the assumption that susceptible residues were identified in individuals without an effective CTL response and resistant residues appeared in the presence of a CTL response (mostly ineffective due to the resistant residue), this analysis represents a comparison of the viral fitness of the susceptible to the resistant variant in the absence of an effective site-specific CTL response. Seven sites were associated with viral loads in individuals with changes away from the susceptible amino acid that were significantly lower than those in which there was no change (Table (Table2).2). Four of the seven residues were in Gag. Gag242 (Gag amino acid 242), associated with HLA-B*5801 and B*57, was previously found to be associated with changes in the viral load and viral fitness (11, 45). Vif33 was associated with HLA-B*1503 and HLA-C*02. Thse two HLA alleles were in linkage disequilibrium, and the site was embedded in a B*1503 epitope binding motif, so B*1503 was considered more likely to be driving the association. HLA-B*1503 was also previously found to be associated with low viral load in a cohort infected with subtype B (24) but not in our South African population infected with subtype C (24, 35, 36). Another site we detected (Pol725, associated with B*4403) had been recognized in an earlier study evaluating HLA-restricted immunological responses to HIV-1 peptides and associations with the viral load in our cohort (36). Gag184 had three HLA alleles (A*01, B*81, C*18) associated with the same single-amino-acid change. However, these three alleles were in linkage disequilibrium, and thus, the associations were considered likely to be driven by a single epitope/HLA combination. Since an HLA-B*8101 epitope had been experimentally identified in this region (36), it was considered to be the most likely candidate. Two of the remaining amino acid sites were within predicted HLA binding motifs (Gag339 and Rev17), and only one site (Gag120) had no immunological support, suggesting that it might be a novel epitope or a false-positive result. A similar analysis was carried out using the resistant residues, but no associations with a q value of <0.20 were found. Intriguingly, five HLA alleles were identified by Kiepiela et al. (35) to be associated with a low viral load in this cohort, and we found four of the five (B*8101, B*57, B*5801, and B*4201) to have specific mutations associated with a low viral load (Table (Table22).
The highest proportion of site associations was found in Nef (Table (Table1),1), where 21% of the amino acid sites in the protein were found to be involved in one or more immunological sets. The lowest proportion of site associations was found in Env, where only 1% were found to be involved in immunological sets.
Because reversions to the more susceptible variant occur in the absence of HLA pressure, the numbers of susceptible and resistant residues at each site association were counted among the individuals without the corresponding HLA allele. To make the results independent of gene size, the ratio of susceptible to resistant residues was determined to reflect the amount of reversion relative to the amount of escape not balanced by reversion. The ratio of susceptible to resistant residues was on the order of Vpr > Gag > Rev > Pol > Nef > Vif > Tat > Env > Vpu (Table (Table1),1), each with significantly more susceptible versus resistant residues than the subsequent protein (Fisher's exact test; P ≤ 0.0009 for each comparison). Although no bias is expected in this analysis, the presence of compensatory mutations may inflate the significance values. These results suggest that there is a greater fitness cost for escape in those proteins with a greater proportion of susceptible residues.
To investigate where the escape mutations occurred within epitopes, individual associations in known epitopes were evaluated. Among the HLA-assigned epitopes in the HIV database (http://www.hiv.lanl.gov/content/immunology/index.html), 73 either encompassed or were within 3 amino acids of an individual association. For 16/73 (22%) of these epitopes, an individual association was found outside the boundary of the epitope but within 3 amino acids of it, possibly reflecting aberrant epitope processing (34). Twenty-two of the 113 (19%) inferred anchor sites and 59/478 (12%) of the internal T-cell receptor-involved sites in known epitopes were found to be associated with an HLA allele in this study. A trend toward detectable selection disproportionately affecting anchor residues was also observed (Fisher's exact test; P = 0.067).
A similar study was performed on fragments of Nef (206 amino acids; n = 686), Vpr (96 amino acids; n = 425), and protease/RT (499 amino acids; n = 532) from a cohort of HIV-1 infected Canadians (97.5% infected with subtype B) (12). We found that the South African subtype C individual associations mapped either inside or within 3 amino acids of 18/69 (26%) Nef, 2/8 (25%) Vpr, and 9/41 (22%) protease/RT immunological sets identified by the Canadian study. However, the same epitope often had different individual associations for the corresponding HLA allele, implying different common escape patterns for homologous epitopes in different HIV-1 subtypes (Fig. (Fig.2).2). Among the shared immunological sets, 86% (25/29) of the epitopes had different escape patterns, depending on the subtype evaluated.
HLA-B and HLA-C alleles were more often identified in the individual associations than HLA-A (7%, 55%, and 38% HLA-A, -B, and -C, respectively). This was also true when the data were analyzed for each gene individually, with the exception of Vpr, which had 11 individual associations (64%, 27%, and 9% associations for HLA-A, -B, and -C, respectively) and Vpu, which had two associations, both with HLA-A. Overall, the proportion of individual associations with a q value of <0.2 was significantly greater for HLA-B than for HLA-A alleles (χ2; P = 3.59 × 10−5) and for HLA-C than HLA-A alleles (χ2; P = 4.71 × 10−6). HLA-B and HLA-C alleles were involved in similar proportions of individual associations with q values of <0.2 (P, not significant). In addition, the majority (60%) of the associations involved an HLA allele with higher frequency than the amino acid.
The sites in each protein had a wide range of amino acid entropy (Env, 0 to 2.64, median = 0.19; Gag, 0 to 1.96, median = 0.06; Nef, 0 to 1.96, median = 0.14; Pol, 0 to 2.01, median = 0.05; Rev, 0 to 1.16, median = 0.23; Tat, 0 to 1.79, median = 0.16; Vif, 0 to 1.54, median = 0.096; Vpr, 0 to 1.24, median = 0.057; Vpu, 0 to 1.91, median = 0.31). Overall, site associations had significantly greater mean entropy than sites without associations (0.59 versus 0.28, respectively; Wilcoxon signed-rank test; P < 0.0001). Most individual proteins also demonstrated a significant relationship between amino acid entropy and site associations (Wilcoxon signed-rank test; P < 0.0001 for Gag, Nef, Pol, Tat, and Vif and P = 0.0007 for Vpr), with the exception of Env and Rev (Wilcoxon signed-rank test; P, not significant for either protein).
Among the individual associations identified, 9% were supported by (embedded in or within 3 amino acids of) the optimally defined “A-list” epitopes (22), 13% by the database of reported T-cell epitopes (“B-list”), and 59% by HLA allele-specific binding motifs, and 20% did not have validation support. Because we used a q value cutoff of <0.2, we expected to see approximately 20% false positives, possibly more frequently in the group with no validation support. Significantly more associations were found to be validated by known epitopes compared to randomized data sets in Gag, Pol, Vif (P ≤ 0.004 for each), and Nef (P = 0.008). No significant differences were observed for the comparisons in the other proteins; they had fewer individual associations, and we thus infer that we did not have adequate power for these comparisons.
Maximum likelihood phylogenetic trees followed by likelihood ratio tests (MLL) or Fisher tests (MLF) or a parsimony analysis followed by a Fisher test was evaluated to identify HIV-1 amino acid changes associated with the HLA genotype. Using a q value cutoff of 0.2 for all methods, the MLL method found the most raw associations, although the greatest number of individual associations was found by MLF (Table (Table3).3). This was primarily due to the MLL method's identification of more pairs of either negative (escape/reversion) or positive (attraction/repulsion) correlations, with each pair identifying the same HLA-amino acid combination. To account for correlated data, the numbers of individual associations were grouped based on sites and HLA alleles. In this case, the MLF method identified the most associations (MLF, 255; MLL, 211; parsimony, 114). On the other hand, in the Canadian data set (12), the MLL method found more individual associations grouped by site and HLA than the MLF method (MLL, 205; MLF, 177). Since the tests make different assumptions about the underlying process, the q values do not necessarily reflect the true percentages of spurious results; therefore, finding more associations at a given q value threshold does not necessarily mean that one method is superior to the others.
A similar proportion of individual associations supported by epitopes from the validation data set (A list, B list, and motifs) was found by each method (49% MLF, 49% MLL, and 54% parsimony; Fisher's exact test; P = 0.40). Using receiver operating characteristic curves and a permutation test, no significant difference was found between the capacities of the three methods to discriminate between sites supported or not supported by epitopes in the validation data set. Thus, the parsimony method, representing a new and more simplified approach, was equivalent to the other two methods.
To evaluate the impact of the number of sequences on the results of our analysis, regions where additional sequences were available from the same cohort (Gag [n = 191], Pol [n = 141], Env [n = 84], and Nef [n = 91]) were reanalyzed, along with the original set of 261 sequences, and compared to the results of the original analysis (excluding the individual associations found with the 9-mers). The number of individual associations in Env increased by 9-fold (9 versus 1), in Gag by 2.1-fold (111 versus 53), in Nef by 20% (102 versus 83), and in Pol by 11% (101 versus 90). Overall, there were 227 individual associations in the original data set in these regions, and 132 (58%) were maintained in the reanalysis with the extended data set. There were a total of 104 site associations identified in the original data set, and 77 (74%) of those were maintained in the reanalysis with the additional sequences. These changes were likely due to loss of false-positive results in the extended data set, as well as slight changes in P values and subsequent changes in q values bringing associations above or below the cutoff q value of <0.2.
The validation support (embedded in or within 3 amino acids of an A list epitope, B list epitope, epitope motif, or no support) for each site association was evaluated in both the original and extended data sets. To take linkage disequilibrium of the HLA alleles into account, the type of validation for each site was categorized only for the HLA with the highest level of support in each linkage disequilibrium group. The greatest number of new site associations identified in Gag and Pol were in the “no support” category, and the greatest number in Nef were in the “epitope motif” validation data (Fig. (Fig.33).
CTL responses to HIV-1 infection can lead to the evolution of escape mutations, effectively reducing the impact of the immune response and the control of disease progression. However, these escape mutations can be balanced by viral fitness constraints. A vaccine that targets both the most fit viral variants and their escape mutations might be effective enough to drive viral evolution toward states of lesser fitness and slow disease progression (50), potentially increasing the quality and duration of life for those infected while lowering the rate of transmission. Thus, our study and other recent work have sought to identify features of the viral proteome that might be especially important for targeting by immune responses elicited by vaccines.
Our approach to identifying critical immunologic features of the viral proteome was to identify associations between the HLA alleles that restrict the CTL response and amino acid variants found in the viruses infecting these individuals, along with their plasma viral loads, using the latter as a surrogate for disease status. Elements of this approach derived from earlier studies that linked HLA with specific amino acids (39, 48), refinement of these associations using phylogenetic correction (8, 14), and associations made between specific HLA alleles and specific viral proteins or the viral load (21, 24, 31, 36). Our study extends prior efforts at defining HLA-amino acid associations using these tools (12) to the analysis of whole viral proteomes and HIV-1 subtype C, the most common subtype worldwide.
Specific amino acid changes associated with HLA alleles were identified and placed into the context of known HLA epitopes and epitope motifs, taking HLA linkage disequilibrium into account. We identified amino acid residues that were predicted to be resistant (n = 244) or susceptible (n = 314) to the HLA class I-mediated immune response with the prediction that susceptible variants were likely to reflect reversions based on selective pressure for increased viral fitness. We also evaluated the plasma viral loads among individuals with the corresponding HLA allele and the presence of the susceptible amino acid in the viral sequence. Some of the sites with changes away from the susceptible residue were found to be significantly associated with lower viral loads, including the well-studied epitope TSTLQEQIGW (TW10 [Gag240 to Gag249]) previously shown to impact viral fitness (11, 45). In total, we identified seven susceptible residues, escape from which was associated with lower viral loads, suggesting a fitness cost of immune escape. Several factors remain unknown for this analysis and likely resulted in a bias for false-negative results. These factors include the sequence of the transmitted strain (or strains), the timing of the CTL response, the presence of compensatory mutations, and the presence of coinfections that impacted the viral load (such as malaria). Furthermore, there may have been undetected associations due to lack of power in our study. Thus, these 7 residues probably represent those with the most robust associations. These results support the idea that a vaccine that induces the CTL response to such epitopes, alone or in combination, may be effective in reducing viral loads.
Although the constituents of an effective vaccine immunogen remain elusive, the results of this study suggest greater importance of some rather than other viral proteins in eliciting suppressive, if not protective, immunological responses. In particular, our results suggest a ranked hierarchy of the proteins and the fitness costs associated with immune escape. Vpr, Gag, and Rev were at the top of the list, suggesting that these proteins are best able to elicit immune responses that decrease the fitness of the virus. On the other hand, the immune responses to Vpu, Env, and Tat (at the other end of the list) were largely ineffective. The high abundance of Gag compared to other HIV-1 proteins (25) makes it a logical vaccine candidate. Thus, this study, in combination with several previous reports, converges on Gag, or elements of Gag, as an important component of a vaccine immunogen. In a prior study, Gag was found to be the only protein targeted by the CTL response that was associated with lower viral loads (36), and Gag p24 has been reported to be the preferred target for HLA alleles associated with protection from disease progression (9). Other studies have shown that the proportion and magnitude of the CTL response to Gag were associated with the viral load (13, 17, 68) and that CTL escape mutations in Gag were associated with reduced viral fitness (2, 20, 40, 45, 52, 53). Gag also has a large number of conserved peptide elements (59) that are less likely to vary in response to immune pressure. Thus, it is probable that the structural and functional conservation of Gag is such that variation in most of its CTL epitopes is not well tolerated, making it a promising gene candidate for a CTL-based vaccine. The finding that Vpr and Rev were also among the proteins with the highest fitness costs associated with immune escape emphasizes the importance of considering these auxiliary proteins as vaccine components.
This study also supports the observations that host HLA alleles provide differential impacts on the virus. We found that HLA-B alleles, followed by HLA-C, were the most commonly associated with amino acid changes and had a greater proportion of significant associations than HLA-A. Our work confirms several previous studies that have identified the importance of HLA-B in driving HIV-1 evolution and impacting disease (12, 35, 56) and also underscores the importance of HLA-C in impacting HIV-1 disease. Recently, a polymorphism upstream of the HLA-C gene was found to be associated with the viral set point in a genome-wide scan (19). Because HIV-1 down regulates HLA-A and HLA-B, but not HLA-C, expression (1, 15) and because HLA-C-restricted cells can have antiviral activity similar to that of HLA-A and HLA-B (1), it may be especially worthwhile to consider HLA-C-restricted epitopes when selecting peptides for a CTL-based vaccine.
HLA population frequencies are also critical to consider in vaccine design. In a previous study, HLA-B*1503 was found to be associated with lower viral loads in a clade B-infected population, in which B*1503 was rare (24), but not in a clade C-infected population, in which B*1503 was common (24, 35, 36). Frahm et al. (24) concluded that fixation of escape mutations in subdominant epitopes was the cause of the lack of response in the subtype C cohort. We did identify an association of HLA B*1503 with low viral load in this subtype C cohort. Indeed, the resistant (escape) amino acid identified in our study was in the consensus C peptides used by Frahm et al., explaining the poor response and in accord with their hypothesis. This finding underscores the fact that any viral sequence used to detect CTL responses, including consensus sequences, may encode escape variants and thus preclude detection of cognate epitopes. More advanced approaches, such as toggle design (23) and others (26), are needed to reliably assess the total breadth of immune responses. We also found that identical epitopes, in the context of a subtype B or subtype C data set, had different susceptible and resistant variants, implying different escape mechanisms, depending on the subtype analyzed. Because the HLA frequency in each population may influence the fixation of viral variants and the viral subtype might also influence immune escape mechanisms, it is possible that different sets of epitopes will need to be considered for each subtype or each unique population.
This study employed three distinct methods for the identification of HLA-amino acid associations, each using phylogenetic correction. Each method contributed novel associations and, using known epitopes as a proxy for a gold standard (since not all epitopes have been identified), all three methods had similar predictive capacities, yet their predictions did not overlap completely. This was, in part, due to each method capturing certain types of associations better than others. For example, MLL was able to identify more details about the associations at each site, whereas MLF was able to identify more sites. Although the parsimony method identified fewer sites, there was greater validation support for those sites. Another explanation for the lack of overlap between the three studies is that each identifies a small proportion of the true associations, supporting the idea that the union of the three methods provides the most comprehensive set of associations.
To define which associations were true positives, we used a validation set of all known epitopes. However, this validation data set, consisting of the best available information, was not an ideal standard for comparison for several reasons. We expect that greater than 50% of CTL epitopes have not been experimentally identified (41). Uncertainty in epitope motifs and nonoptimized epitopes can also lead to incorrect support of an association. There also exists a bias for subtype B among known epitopes (49). Finally, compensatory mutations were not considered in the validation data set, and these can involve several amino acids or intermediate steps (16, 33, 42, 67). Thus, the associations with no validation support could have been novel epitopes, compensatory mutations, or false positives. For this reason, we did not exclude any of our significant associations based on lack of validation support.
Because we did not know the exact sequence of the transmitted strain from each of the infected individuals in our study, we relied on phylogenetic analyses to infer the evolutionary history of the viral populations. This inference may have introduced error into the analysis that was likely to bias the results toward the null hypothesis, producing false-negative associations. The length of the terminal branch (the average in our study was 0.047 mutations/site/generation) reflects the amount of evolution that has taken place since the most recent ancestor and may also reflect the amount of time in the transmission chain from the inferred ancestor to the terminal sequence. This time was estimated to be an average of 10 years, assuming a molecular clock, a rate of evolution of 2.5 × 10−5 mutations/site/generation (44), and a generation rate of 2 days (51). This estimate is likely inflated due to recombination. Undetected transmission events may have happened during this time, leading to the potential increase in detection of false-positive results using MLF and parsimony. However, the branch lengths were taken into account in the MLL method, and this method was not found to be better or worse than MLF or parsimony, suggesting that the time elapsed from the most recent ancestor of each sequence did not adversely impact our ability to identify evolving sites within known epitopes.
In summary, we have been able to identify HIV-1 amino acids that evolve in response to HLA-mediated cellular immunity among a South African population primarily infected with subtype C, including seven susceptible residues, escape from which was associated with lower viral loads. Moreover, we have provided a ranking of proteins based on the fitness cost of immune escape and, based on this ranking, recommend the inclusion of Gag, Vpr, and Rev as vaccine components. Our analysis also showed the importance of HLA-B and HLA-C in driving HIV-1 evolution. The information from this study can be used to design follow-up analyses to characterize CTL epitopes and viral fitness for consideration in a CTL-based vaccine. Although this study does not prove that an effective CTL-based vaccine is achievable, it provides encouragement for future research in this area.
We acknowledge Zabrina Brumme and Richard Harrigan for generously contributing data from the British Columbian HOMER cohort and Yi Liu and David Lockhart for helpful discussions and assistance. We also thank all of the study subjects for their participation.
This work was funded by NIH contract N01-AI-15422 (HLA typing and CTL epitope mapping to guide HIV vaccine development); U.S. Public Health Service awards AI11514, AI27005, AI058894, AI061734, and AI067077; and the University of Washington Center For AIDS Research (AI27757), including a New Investigator Award (AI047734) to C.M.R. B.T.K., T.B., and M.G.D. were funded through a Los Alamos National Laboratory directed-research grant and NIH AI061734. J.M.C., C.K., and D.E.H. were funded by Microsoft Research.
Published ahead of print on 23 April 2008.