|Home | About | Journals | Submit | Contact Us | Français|
Evolution is an adaptive walk through a hypothetical fitness landscape, which depicts the relationship between genotypes and the fitness of each corresponding phenotype. We constructed an empirical fitness landscape for a catalytic RNA by combining next-generation sequencing, computational analysis and ‘serial depletion’, an in vitro selection protocol. By determining kobs for every point mutant of a catalytic RNA, we demonstrated that abundance in serially depleted pools correlates with biochemical activity (r = 0.67, Z = 7.4). Therefore, enumeration of each genotype by deep sequencing yielded a fitness landscape containing ~107 unique sequences, without requiring measurement of the phenotypic fitness for each sequence. High-throughput mapping between genotype and phenotype may apply to artificial selections, host-pathogen interactions and other biomedically relevant evolutionary phenomena.
In vitro selection of RNA has led to the isolation of aptamers and ribozymes expanding our understanding of the biochemical capabilities of this nucleic acid (1). SELEX methodology also allows the study of the process of evolutionary adaptation (2), which has been conceptualized as an optimizing walk through a fitness landscape (3). In molecular evolution, such a landscape represents the sequence space of a macromolecule and the fitness associated with each genotype (4, 5). Two major difficulties have been identified in the empirical construction of macromolecular fitness landscapes. First, even for macromolecules of modest length, the sequence space is vast; a 20-mer RNA or protein has ~1012 and ~1026 possible sequences, respectively. Second, in order to characterize the landscape, the phenotypic fitness of each individual genotype needs to be measured, or an indirect measure of fitness needs to be validated. Therefore, although fitness landscapes have been constructed through computer simulations [e.g. (6-8)], experimental analyses of fitness landscapes have typically been limited to dozens to hundreds of genotypes [reviewed in (9)].
Advances in DNA sequencing methodology allow us to sequence ~108 individual molecules of lengths up to ~50 nucleotides (nt) with high accuracy, producing larger accessible experimental sequence spaces (10). During an in vitro selection experiment, the representation of RNA molecules that are more active should increase. Thus, we hypothesize that the abundance (or ‘fecundity’) of a particular sequence in an RNA pool undergoing selection, or its rate of increase, may serve as a surrogate fitness metric. Thus, deep sequencing of earlier stages of an in vitro RNA selection experiment (before the pool has become dominated by the most active species) and enumeration of the frequency of each genotype may directly provide an experimental fitness landscape.
To establish the validity of this approach, we mutagenized a well-characterized in vitro selected RNA ligase ribozyme and subjected the resulting pool to reselection. We chose the ‘class II’ ligase (11) because its length (54 nt) is comparable to those of reliable individual reads by the Illumina genome analyzer, and because the ‘wild-type’ [isolate a4-11, hereafter ‘master sequence’, Fig. S1 and (11)] of this catalytic RNA is extremely active, likely being near-optimal. The mutagenized pool (45 nt at a degeneracy of 21% per position; expected frequency of master sequence = 0.0025%) can be analogous to a viral population arising due to error-prone replication (12). Our class II ligase construct catalyzes the formation of a 2′-5′ phosphodiester bond between its 5′ triphosphate and the 3′ terminus of a ‘substrate’ RNA that is immobilized on a magnetic bead [Fig. S1 and (13)]. Additional sequence pools served as negative controls. One was comprised of ~1013 random RNAs; the other of mutants of a sequence variant of the class II ligase that has been engineered to favor 3′-5′ bond formation, but which is crippled in catalytic activity [(14) and Fig. S1].
Our control pools were flanked by constant sequences allowing their members to be distinguished from those of the master sequence pool, and neither was expected to be significantly enriched during the course of selection. The three RNA pools were mixed to generate a starting pool with a complexity of ~ 6 × 1013 (13), with the genotypes of members of each pool centered on a characteristic sequence distance (number of mutations) from the master sequence (Fig. 1A).
To separate individual sequences by their reactivity, we incubated the RNA pool with substrate RNA linked to beads, depleting the most active species from the population. RNAs were allowed to react for 24 hours. Then, species capable of ligation to the beads were sequenced, and their genotypes and frequencies analyzed [Fig. 1, Table S1, and (13)]. This one-step selection experiment resulted in an enrichment of the master sequence pool at the detriment of the random and engineered pools. The most abundant sequences in the starting pool are PCR artifacts from the random pool (Fig. 1B). These disappear in the 24 hour selected pool (Fig. 1D). Moreover, the representation of the master sequence (Hamming distance = 0) increased markedly, from 0.0015% to 0.28% of the total population (Fig. 1C,E).
Since the enrichment of a sequence in the selection experiment should be proportional to its catalytic activity, we successively depleted an RNA pool with a stoichiometric excess of immobilized substrate for increasing time intervals in a ‘serial depletion’ experiment. Comparison of the population captured after a one minute incubation with that captured after 24 hours shows that the former is composed of sequences that are markedly closer to the master sequence (Fig. 2A). If the fecundity of genotypes is proportional to their activity, then the rate of change of their frequency over the time-course, plotted as a function of genotype space, would represent a fitness landscape (Figs. 2B, S2, and S3). Indeed, populations selected early were more active than those selected at the longest time-point, as evidenced by their bulk kobs (Fig. 3A). The populations selected at the three shortest time-points had comparable activity to each other, but the serially depleted population that was allowed to react for 24 hours was only one third as active. Consistent with early selection of the most active sequences, the information content (15, 16) of the early time-points is also highest [Fig. 3A and (13)].
To demonstrate experimentally that the frequency of each sequence is indicative of its catalytic activity and not a due to interactions within the RNA population, all 162 possible single point mutations of the master sequence were synthesized, and their kobs measured [Fig. S4, Table S3, and (13)]. The kobs of point mutants in all positions varied in the library (n = 135) correlated positively (r = 0.67) with the frequency of sequences containing the mutations in the 24 hour incubation (Fig. 3B). We tested the significance of this correlation through computation of the distribution of the correlation coefficients of randomly assorted experimental kobs and observed mutation frequencies [mean = 0.0088, s.d. = 0.089, Fig. 3B, Table S4, (17)]. This analysis indicates a significant regression coefficient (Z = 7.4, p << 0.0001, t-test). Therefore, the fecundity of any individual sequence provides a metric of its fitness, and we can create an experimental fitness landscape comprised of ~107 different RNA genotypes in a single experiment (Fig. 2B)
The empirical fitness landscape we generated is a high-dimensional object. We visualized it by computing the information content per residue of the master sequence, in essence projecting the landscape onto the ribozyme sequence. We plotted the information content for sequences most abundant in the 1 minute time-point (Dmax = 1 minute), and those most abundant in the 24 hour time-point (Dmax = 24 hours) on a revised secondary structure of the ribozyme generated from sequence covariation information implicit in our data [Figs. 4A,B, Tables S5–S9, Fig S5, (13)]. This analysis reveals that, in addition to the previously characterized key residues surrounding the ligation junction (11, 14), sequences in the central bulge of the RNA and the distal end of P3 are important for maximal activity. The location of these residues under purifying selection suggests that the class II ligase may fold into a Y-shaped structure (18) that brings the ligation junction and L3 into close proximity. The difference in information content between members of the slow- and fast-reacting populations (Fig. 4C) provided a measure of the slope of the landscape that the evolving population must traverse to achieve maximum fitness. As expected, the catalytic rates measured for each point mutation (Fig. S5D) support the functional importance of P1, P3 and L3.
In addition to studying the phenotypic fitness and positional information content of populations comprised of closely related sequences, our methodology (which produces orders of magnitude more sequence information than was practical before) could also be used to query the early rounds of SELEX experiments starting with random pools. Indeed, the locality-sensitive hash algorithm we implemented is more efficient in analyzing libraries with greater sequence diversity (19). A typical SELEX experiment employs multiple rounds of purifying selection and amplification to separate highly active sequences from other members of the starting population, assumed to be inactive (20-22). However, this assumption is only rarely tested (2). While aggregate activity of the population in earlier rounds of selection is low, there may be many sequences present that represent suboptimal solutions. Recent experiments demonstrated that minute RNAs, working alone or in ensembles, can be highly active catalysts (23, 24). For such small functional nucleic acids (≤ 13 nt) our methodology can determine the fitness of every possible RNA (413 or 6.7 × 107 possible genotypes).
One application of empirically generated fitness landscapes is the analysis of systems in genetic conflict (25); for example, between a host immune response and an infecting viral quasispecies (26, 27). In such systems, positive selection drives phenotypic variation that maximizes sampling of their respective fitness landscapes. However, these systems are dynamic, with changes in one population leading to changes in the other. By analyzing regions of the fitness landscapes that remain fixed between multiple conditions we predict that critical regions of the molecules in question can be identified for vaccine or drug development.
We thank R. Basom, A. Marty, and the staff of the Fred Hutchison Cancer Research Center Genomics Shared Resource for performing the Illumina sequencing, and D. Bartel, B. Fritz, R. Knight, H. Malik, W. Noble, H. Robins, A. Roll-Mecak, and M. Tewari, for discussions. This work was supported by the W.M. Keck Foundation and the Howard Hughes Medical Institute (HHMI). A.R.F. is an Investigator of the HHMI.
Accession codes: Illumina sequence data have been deposited with the NCBI Sequence Read Archive with accession code SRA020870.1.