We first evaluated whether global RNA genome structure is linked to protein structure. HIV-1 produces three major classes of mRNA. The 9 kb class encodes Gag and Gag-Pol and is identical to the packaged genomic RNA analyzed here except, as an mRNA, it is not dimerized at its 5′ end2
. There are very few differences in the SHAPE reactivity of dimeric and monomeric RNAs at the 5′ end of the genome6
. Thus, genome structures outside of the dimerization region will correlate closely to the mRNA that encodes Gag and Gag-Pol. The most abundant 4 kb env
mRNA is generated by splicing nucleotide 288 (SD1, the major splice donor) to nucleotide 5522 (termed the SA5 site)15
. SA5 is followed by an unstructured genome region (). Thus, RNA structures identified in the env
coding region are likely to exist in the spliced mRNA that encodes Env. Structures for the 1.8 kb class of mRNAs, which generate Tat and Rev, cannot be predicted using the genomic RNA because discontinuous segments are joined in the final mRNA.
The Gag, Gag-Pol and Env polyprotein precursors are synthesized roughly as beads on a string and the constituent proteins are liberated by proteolytic cleavage2,3
(). Eight inter-protein peptides link the HIV proteins (green bars, ). The RNA sequences that encode these spacer peptide linkers in Gag (at the MA-CA, CA-NC, and NC-p6 junctions), Pol (PR-RT, RT-RNase H junctions) and Env (SP-gp120, gp120-gp41 junctions) all (except the RNase-IN junction) have SHAPE reactivities that are much lower than the median (). RNA sequences that encode these inter-protein peptide linkers are more highly structured than 95.2% of randomly selected regions in the genome (Fig. S4a
Domains within the individual HIV-1 proteins CA, RT, and IN are also linked by unstructured peptide elements and each domain junction is encoded by an RNA region of low SHAPE reactivity (compare yellow bars in with dark blue trace in ). Protein loops encoded by RNA regions with low SHAPE reactivity include the cyclophilin loop and the linker between the N- and C-terminal domains in CA, both loops that link independently folded domains in RT, and the 8 and 9 amino acid loops linking the three domains in IN (in yellow, ). These protein domain junctions are more highly structured than 88.9% of randomly selected equivalent-length regions in the genome (Fig. S4b
In contrast to the other large HIV proteins, domains in gp120 (termed inner, outer, and bridging sheet) are not structurally autonomous. The C-terminal 35 residues of gp120 weave from the outer to the inner domain and the bridging sheet is comprised of residues that are 315 positions distant16
. Junctions between domains in gp120 are also not encoded by highly structured RNA, suggesting that gp120 folding is not linked to RNA structure in the same way as for other HIV proteins because its constituent domains are not structurally independent.
The recurring pattern of structure, conspicuously located near or after autonomously folding protein coding domains, is consistent with a model in which HIV protein structure is encoded in its RNA at two distinct levels. The first is the linear relationship between RNA and protein primary sequences. In the second level, higher-order RNA structure directly encodes protein tertiary structure because unstructured protein loops are derived from highly structured RNA elements. Many proteins appear to fold during translation17
, highly structured RNA slows and causes ribosomal pausing during translation18,19
, and changes in the extent of local RNA structure modulate protein activity20
. Together, these observations suggest that attenuation of ribosome elongation by highly structured RNA at protein domain junctions facilitates native folding of HIV proteins by allowing time for domains to fold independently during translation.
This model makes the clear prediction that ribosome pause sites should occur preferentially in the highly structured regions of an HIV-1 RNA that encode protein junctions. We tested this idea using a toeprinting experiment, in which ribosome processivity is inhibited by cycloheximide and sites preferentially occupied by the ribosome are detected as stops to primer extension in an in vitro
. Ribosome pause sites are statistically overrepresented at the MA-CA and CA-NC junctions in Gag and at the sequences encoding the cyclophilin loop in CA (Fig. S5
). Conversely, ribosome pause sites are underrepresented in flanking, but unstructured, regions of the HIV RNA (p = 0.018). These experiments thus strongly support the model that mRNA structure over a region spanning 60–100 nucleotides specifically modulates ribosome processivity at protein domain junctions.