We have constructed an in-depth map of the HIV-1 genome that presents the landscape of genetic variation in the context of several levels of structural and immunological constraints. Over two-thirds of the viral genome and proteome are conserved. Conservation is strongly determined by RNA structure and, at the protein level, by the need to maintain α-helix domains. On the other hand, 12% of the genome are under positive selection, with an enrichment of sites observed in CD4 T cell and antibody epitopes. Previous studies advanced the understanding of protein [7
] or RNA [10
] structural constraints on viral genome diversity, or on immune selective pressures [11
]. Here, we examined the viral genome under the paradigm of multiple layers of information.
Genomes of single-stranded RNA viruses contain important structures that support internal ribosome entry sites, packaging signals, pseudoknots, transfer RNA mimics, ribosomal frameshift motifs, and cis
-regulatory elements. Watts et al
] used high-throughput SHAPE to interrogate nucleotide flexibility in the HIV-1 genome, as well as estimates of pairing probability at each nucleotide. This approach led to the identification of 10 regions that exhibit both low SHAPE reactivity and high pairing probability. Most genome regions with low SHAPE reactivity were shown to associate with a regulatory function. They proposed a model in which, in addition to the linear relationship between RNA and protein primary sequences, there is a second level of higher order RNA structure that directly modulates ribosome elongation, thus influencing native protein folding and tertiary structure. Although the present study uses data on SHAPE reactivities derived from a single viral strain, Watts et al
] compared the empirical data with evolutionary base-pairing probabilities predicted using an alignment of non-recombinant group M subtype sequences from the Los Alamos database, and found that only four regions were in disagreement. Overall, the present study underscores that this novel component of the genetic code represents the strongest determinant of conservation.
Our study also indicates that protein structure, specifically α-helix domains, is associated with conservation. The α-helix is the most important and stable structural element in proteins [13
]. In contrast, more variation can be accommodated by β-sheets. Importantly, both layers of constraint, RNA and protein structure independently determine conservation and limit viral escape from selective pressures. These results confirm findings by Sanjuan et al
., describing an association between SHAPE reactivity and second-codon position diversity (as a measure for protein sequence variation) and non-synonymous substitution rates (as a measure for selective pressure) [14
The non-conserved viral genome encompasses two classes of sites, variable residues under relaxing constrain, and sites that are identified as being under positive selection, indicating higher fitness in a given host environment. We investigated three canonical selective forces of adaptive immunity: CD8 T cell, CD4 T cell and antibody responses. The results identify pressures that reflect population effects; i.e., a number of hosts share both the selective factor (the host factor) and the direction of the selecting force (escape). Here, an association was established for CD4 T cell and antibody responses and positive selection at cognate epitopes. Escape from antibody responses has clearly been demonstrated to occur from the very earliest phases of HIV-1 infection [15
]. Escape for CD4 T cell responses is far less clear cut; however, the relationship between CD4 T cell help and the maturation of the antibody response [17
] may certainly contribute to the association with positive selection at cognate epitopes. However, the undisputed relevance of CTL action - widely associated with viral escape [16
], was not identified at population level polymorphism. Our interpretation is that the diversity of restricting alleles in the human population [20
], the large proportion of sites in the viral genome identified as coding for CD8 T cell epitopes, and the diversity of fitness consequences of escape at the different CD8 T cell epitopes, fail to create a local signature of positive selection that can be identified in the viral genome at the population level. In addition, Irausquin and colleagues [12
] indicated that many nonsynonymous mutations in both CD8 T cell and CD4 T cell epitopes are subjected to conflicting evolutionary pressures, with positive selection favoring escape mutations within hosts expressing the respective presenting HLA molecule and purifying selection acting to remove them in the population at large. Another explanation could be that escape mutants without deleterious effects become quickly fixed in the population so that these epitopes are relatively conserved [11
Overlapping reading frames are generally thought to be evolutionarily stable and to be conserved, as mutation in one frame can negatively affect the second gene [21
]. However, we identified low conservation and positive selection in those regions. An important caveat, however, is that there are currently no methods available to identify reliably site-specific positive selective pressure in these regions [22
]. Thus, the approach we used may overestimate positive selection in overlapping reading frames. We also considered a possible contribution of hypermutation to structural constraints and positive selection. Only three strains were identified as being hypermutated. The analysis also explored the distribution of APOBEC3G/F editing across the genome. Three patches showed enrichment in AG and AA dinucleotide motives in a genome-wide positional screen. Although they were associated with sites under positive selection, it is not possible to establish a link of causality between deamination and genome evolution.
The joint analysis of different sets of information generated a comprehensive view of the complete genome. However, gene-specific analyses showed instances of departure from the genome-wide estimates. For example, CD8 T cell epitopes were generally well conserved, except in gp120, where they were enriched for sites under positive selective pressure. Similarly, CD4 T cell epitopes were enriched for sites under positive selection genome-wide, but these epitopes were significantly more conserved in gag. Thus, the various constraints and selective pressures do not act evenly across the genome.
Overall, the present study extends previous analyses by using a larger curated dataset of near-complete subtype B genome sequences to jointly analyze different conservation and evolutionary forces, The study by Sanjuán et al
. on the interplay between RNA structure and protein evolution [14
] used a hundred sequences and excluded tat
from the analysis. The study by Irausquin et al
., on T cell epitopes [12
] used between 46 and 599 sequences (depending on the gene) and did not include gp41
; the genes with the strongest signals of positive selection in the present study. The study by Woo et al
] analyzed the relationship between protein structure and evolutionary pressures on HIV-1 gag
proteins by studying solvent-accessibility as a measure for protein structure, and Shannon entropy as a measure for protein diversity. They found a clear relationship between dN/dS ratio and the solvent-accessibility of the residues in the protein, with surface amino acids being under positive selection, and buried amino acids under purifying selection. They found no relationship between variability (as measured by Shannon entropy) and protein structure (helix or strand). However, our results point to an association between conservation and domains in α-helices, but only in the genome-wide analysis. The recent identification of multidimensional constraints on HIV-1 Gag evolution [23
] also points towards analyses that could benefit from the layers of information included in the current study with the goal of better identifying regions of immunological vulnerability.